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Abstract: We assume data independently sampled from a mixture distribution on the 
unit ball of M. D with K + 1 components: the first component is a uniform distribu- 
tion on that ball representing outliers and the other K components are uniform dis- 
tributions along K d-dimensional linear subspaces restricted to that ball. We study 
both the simultaneous recovery of all K underlying subspaces and the recovery of the 
best Zo subspace (i.e., with largest number of points) by minimizing the i p -averaged 
distances of data points from d-dimensional subspaces of M. D . Unlike other l p mini- 
mization problems, this minimization is non-convex for all p > and thus requires 
different methods for its analysis. We show that if < p < 1, then both all underly- 
ing subspaces and the best ?o subspace can be precisely recovered by l p minimization 
with overwhelming probability. This result extends to additive homoscedastic uniform 
noise around the subspaces (i.e., uniform distribution in a strip around each of them) 
and near recovery with an error proportional to the noise level. On the other hand, if 
K > 1 and p > 1, then we show that both all underlying subspaces and the best Iq 
subspace cannot be recovered and even nearly recovered. The uniform distributions of 
both subspaces and outliers can be easily replaced by spherically symmetric distribu- 
tions, e.g., Gaussian distributions. Similarly, the uniform noise can be easily relaxed by 
assuming a distribution from a parametric family with the expectation of the absolute 
value approaching zero when its parameter approaches zero. The results of this paper 
partially justify recent effective algorithms for modeling data by mixtures of multiple 
subspaces as well as for discussing the effect of using variants of l p minimizations in 
RANSAC-type strategies for single subspace recovery. 
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1. Introduction 

The most common tool in high-dimensional data analysis has been Principal Compo- 
nent Analysis (PCA), which approximates a given data set by a low-dimensional affine 
subspace. More recent works extend PCA to approximation by several subspaces. How- 
ever, many popular methods for such modeling problems are not robust to outliers. 
Furthermore, methods whose robustness has been numerically demonstrated for spe- 
cial cases, often lack theoretical guarantees. In particular, robustness to outliers of any 
strategy for recovering multiple subspaces has not yet been proved beyond some ex- 
perimental evidence. 
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In practice, some of the most successful methods for recovery of subspaces use 
the li distance. In the context of multiple K subspaces, the l p distance of a data set 
from the mixture of those subspaces is gotten by first computing for a given point the 
minimal Euclidean distances of data points to all K subspaces and then taking the l p - 
averaged sum over all points. Thus, one can try to recover K d-dimensional subspaces 
by minimizing the l\ distance of the underlying data over all possible K d-dimensional 
subspaces (if K = 1, then we recover this way a single subspace). The robustness 
of the li norm has been rigorously quantified in various important settings, such as li 
regression and principal components pursuit. However, we are not aware of rigorous 
justification of the l\ subspace recovery even when K = 1. Indeed, a crucial distinction 
of the l\ subspaces recovery from other l\ recovery problems is that it involves a non- 
convex optimization and thus requires very different methods for its analysis. 

The purpose of this paper is to explore the effectiveness of subspace recovery by 
l p minimization for all p > under the assumptions of uniform outliers and "uniform 
sampling" along the underlying subspaces (or a strip around them). We typically refer 
to this situation of uniform outliers as a point cloud. In the case of clean subspaces (i.e., 
without additive noise), we address two different questions. One question is whether it 
is possible to simultaneously recover all underlying subspaces via l p minimization for 
some values of p > 0. The other question is whether the best lo subspace, that is, the 
subspace containing the largest number of points, can be recovered via l p minimization 
for some p > 0. While this question is about a single subspace recovery, the underlying 
model still assumes multiple subspaces and consequently the l p optimization is not so 
trivial to analyze. If one can positively answer this question for a more general setting, 
then one can prove that iterative repetition of l p recovery of the best Iq subspace K 
times results in recovery of all underlying K subspaces. 

We extend the solutions of these questions to the case of additive noise around the 
underlying subspaces, while allowing the recovery error to be controlled by the noise 
level. We also suggest relaxations of the uniform model described here. 

1.1. Background and Related Work 

The l\ norm has been widely used to form robust statistics. For example, the geometric 
median is the point in a data set minimizing the sum of distances from the rest of 
data points, i.e., the Zi-averaged distance. For points on the real axis, it coincides with 
the usual median. Its robustness is most commonly quantified by showing that it has 
a breakdown point of 0.5 (i.e., the estimator will obtain arbitrarily large values only 
when the proportion of large observations is at least a half) [32]. The l\ norm has also 
been successfully applied to robust regression [24, 22, 38, 35]. 

Basis pursuit [11] uses l\ minimization to search for the sparsest solutions (i.e., solu- 
tions minimizing the norm) of an undercomplete system of linear equations. It is used 
for decomposing a signal as a linear combination of few representative elements from a 
large and redundant dictionary of functions. In this application one often preprocesses 
the data by normalizing the columns of the underlying matrix by their I2 norm. Donoho 
and Elad [17] have shown that "sufficiently sparse" solutions can be completely recov- 
ered by minimizing the l\ norm instead of the Iq norm. However, this result restricts 
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the size of the mutual incoherence M of the dictionary and consequently the size of the 
sparse solution (which is inversely controlled by M). Other works [7, 16, 15, 8] show 
that for the overwhelming majority of matrices representing undercomplete systems, 
the minimal l\ solution of each system coincides with the sparsest one as long as the 
solution is sufficiently sparse. Moreover, this fact holds even when noise is added to 
the decomposed signal (with a slight modification of the problem formulation). 

Candes et al. [6] proposed and analyzed the principal component pursuit algorithm 
for robust PCA, which minimizes a weighted combination of the nuclear norm and 
a different l\ norm (allowing convex optimization) among all decompositions match- 
ing the available data. A simpler use of the li norm between given data points and 
representative points in a lower dimensional model (though without using the nuclear 
norm term to infer this model) has appeared in several other works [3, 20, 30, 28]. 
Alternatively, Ding et al. [14] as well as Brooks and Dula [5] used the geometric and 
non-convex l\ recovery of a single subspace defined earlier (i.e., minimizing the l\ sum 
of Euclidean distances of data points from all possible subspaces). However, we are not 
aware of any quantitative study of the robustness of this best h subspace to outliers in 
the setting of both multiple underlying subspaces and a point cloud (whose data points 
are not necessarily far away from the subspaces). 

The latter l\ (or l p ) subspace minimization can also be applied to Hybrid Linear 
Modeling (HLM), i.e., the modeling of data by mixtures of affine subspaces. This kind 
of modeling finds diverse applications in many areas, such as motion segmentation in 
computer vision, hybrid linear representation of images, classification of face images 
and temporal segmentation of video sequences (see e.g., [46, 34]). There are already 
many algorithms for HLM [25, 12, 42, 41,4, 45, 26, 27, 23, 46, 48, 49, 34, 33, 10, 1, 50, 
51]. Among these, the ones suggesting robust strategies to deal with many outliers are 
RANSAC (for HLM) [49], Robust GPCA [34], SCC [10], Sparse ALC [37], MKF [50] 
(or any l p variant of A'-subspaces [25, 4, 45, 23] when < p < 1) and LBF [51]. Both 
MKF and LBF apply (in different ways) the l\ subspace minimization discussed in this 
paper, whereas RANSAC (for HLM) can be successfully modified utilizing such l\ 
minimization (in the spirit of [43, 44] who use other norms). Sparse ALC also applies 
an li minimization, which is different than the one discussed here (in particular, it 
involves convex optimization). 

Despite the many HLM algorithms and strategies for robustness to outliers, there has 
been little investigations into performance guarantees of such algorithms. Accuracy of 
segmentation of HLM algorithms under some sampling assumptions is only analyzed 
in [9] and [2], whereas tolerance to outliers of an HLM algorithm under some sampling 
assumptions is only analyzed in [2] (in fact, [2] analyzes the more general problem of 
modeling data by multiple manifolds, though it assumes an asymptotically zero noise 
level, unlike [9]). 

1.2. Contribution of This Paper 

This paper studies the effectiveness of recovering subspaces in point clouds by l p sub- 
space minimization for different values of < p < oo. In particular, we study the 
recovery of the best Iq subspace by the best l p subspace, that is, feasibility of using 
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the best l p subspace as the best Iq subspace. We also study full recovery of all K 
underlying subspaces by the collection of K subspaces minimizing an l p energy for 
multiple subspaces. We restrict the discussion to linear subspaces, which we refer to as 
d-subspaces. 

We assume an underlying data set X C MP of N points independently sampled 
from the mixture measure defined as follows (while distinguishing between two cases 
according to the presence of noise). 

Definition 1.1. We say that a probability measure p on the unit ball B(0, 1) ofR D is a 
uniform mixture measure if p = z~2i=o a il JL i' where {ai}fL are nonnegative numbers 
summing to 1, p$ is the uniform probability measure (i.e., scaled Lebesgue) in the unit 
ball and {^i}fLi are uniform probability measures along the restriction to the unit 
ball of distinct d-subspaces ofWP, {Li}fL v For e > 0, we say that p e is a uniform 
mixture measure with noise level e if n e = aopo + a i^i,e, where {ai}f =0 and 

po are the same as above and {pi y e}fLi are uniform on the cylinders (L^ n B(0, 1)) X 
(L^~ l~l B(0, e)J, i = 1 . . . K, around the d-subspaces {L^}^. 

To simplify this introduction we discuss here the clean case with underlying uni- 
form mixture measure /i. We first explain the l p recovery of the best l subspace, i.e., 
the subspace containing the largest number of points of X (so that its complement 
minimizes the Iq distance). When addressing this problem, we will always assume the 
following condition (using notation of Definition 1.1): 

K 

a\ > y^aj. (1) 

i=2 

This condition implies that Li is the best Iq subspace for X with overwhelming prob- 
ability. By saying "with overwhelming probability", or in short "w.o.p.", we mean that 
the underlying probability is at least 1 — Ce~ N / c , where C is a constant independent 
of N, but possibly depending on other parameters of the underlying uniform mixture 
measure. 

The l p recovery of the best Iq subspace for X minimizes the quantity 

e lp (X,L)= ^dist(x,L)f, (2) 

where dist(x, L) denotes the Euclidean distance between a data point x and the sub- 
space L. This is not a convex optimization, since it takes place on the Grassmannian. 
We refer to the minimizer of (2) as the best l p d-subspace. 

Our main result for exact l p recovery w.o.p. of the best Iq subspace from multiple 
clean subspaces in point clouds is formulated in the following theorem. We remark that 
in this theorem as well as throughout the rest of the paper the distance between two 
subspaces is the geodesic one, which is specified later in (8). 

Theorem 1.1. If p is a uniform mixture measure on MP with K d-subspaces {L^}!^ C 
G(D,d) and mixture coefficients {ai}fL satisfying (1), X is a data set of N points 
independently sampled from [i and < p < 1, then the probability that Li is a best l p 
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subspace is at least 1 — C cxp(— N/C), where C is a constant depending on D, d, K, 
p, ato, a\ and min 2 <i<jf(dist(L 1 , Li)). 

Next, we address the second problem of simultaneous recovery of all K subspaces 
via l p minimization of the following energy, defined for the data set X and any sub- 
spaces Li, • ■ ■ , Lk- 

ei p (X f Li,--- ,Ljf) = min (dist(x, L.j)) p . (3) 

The following theorem states that when < p < 1 the minimization of this energy 
exactly recovers w.o.p. the underlying clean K subspaces within a point cloud. 

Theorem 1.2. If [i is a uniform mixture measure on R 13 with K d-subspaces {Li}fL 1 C 
G(D, d) and mixture coefficients {cti}fL Q , X is a data set independently sampled from 
fi and < p < 1, then there exists a positive constant Vq = Vo(d, K,p), such that 
whenever 

ao < — • min ctj ■ min(2, min -dist(Lj, L,-)), (4) 

2 i=l,"' ,k 

then the set {L 1; L 2 , • • • , } minimizes the energy (3) among all d-subspaces in 1$L D 
with overwhelming probability. 

For the noisy setting, we assume a uniform mixture measure with noise level e and 
show later in Section 5 that the above two l p minimization procedures with < p < 1 
nearly recover w.o.p. (up to an error of order e) the Iq subspace and the K underlying 
subspaces. In fact, we also extend there these results to p > 1 and K = 1. That is, we 
will show that a single underlying subspace in a point cloud can be nearly recovered 
(with error proportional to the noise level) by l p minimization for any p > 0. On the 
other hand, we later establish in Section 6 a phase transition phenomenon for multiple 
underlying subspaces. That is, if K > 1 and p > 1, then the l p recovery as well as near- 
recovery of the best subspace will not work well. We will also provide there some 
indication why we expect a similar negative result for l p recovery of all K underlying 
subspaces, when p > 1 and K > 1. The uniformity assumption (along subspaces, of 
outliers and of noise) will be relaxed in Section 8. 

The theory developed here is a quantitative study of robustness of l p subspace ap- 
proximations in point clouds. We are not aware of other informative quantifications of 
robustness. Indeed, the notion of a breakdown point of robust statistics [24, 22, 38, 35] 
does not directly apply to best l p subspaces, since they are contained in a compact 
space, i.e., the Grassmannian, and thus the discussion of arbitrarily far element is ir- 
relevant. On the other hand, measuring the influence function, which is also common 
in robust statistics [24, 22, 38, 35] is not informative for our probabilistic model (as 
opposed to sufficiently far outliers). 

This quantitative study of robustness has direct implications for both single subspace 
modeling and hybrid linear modeling in point clouds. We will use Theorem 1.1, its 
extension to noise (Theorem 5.1) and the breakdown of both theorems when p > 1 
(Theorem 6.1) in order to analyze the effectiveness of ^p-based loss functions in a 
RANSAC framework (in the spirit of [43, 44]). We will also use Theorem 1.2 and its 
extension to noise (Theorem 5.2) to partially justify two different robust algorithms for 
HLM [50, 51]. 
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1.3. More Results and Structure of the Paper 

Additional theory is developed throughout the paper in the following order. In Sec- 
tion 2 we describe basic notation and review frequently used concepts. In Section 3 we 
specify general algebraic conditions for a best Iq subspace to be a local minimum of 
the energy (2) for various < p < oo. We also demonstrate natural instances, distinct 
from point clouds, where the best Iq subspace is neither a local l p subspace (even for 
p = 1) nor global one (even for < p < 1). Section 4 involves data sampled from a 
mixture composed of a uniform distribution along a single d-subspace and a uniform 
background of outliers. It studies when the best Iq subspace for such data is either a 
local or global minimum of the energy (2) for < p < 1 (for example, if one samples 
TVo outliers and Ni inliers and if both jV = o(Nf) and p = 1 or both N = 
and < p < 1, then the best Iq subspace is a local l\ minimum). Theorems 1.1 
and 1 .2 above extended part of this study for data sampled from several rf-subspaces 
with an outlier component. Section 5 extends the latter two theorems to near-recovery 
in noisy setting, whereas Section 6 discusses failures of l p recovery or near-recovery 
when p > 1 and K > 1. Section 7 uses some of the theory developed here to partially 
justify two effective algorithms for robust HLM as well as an approach for single sub- 
space recovery. Section 8 discusses some immediate extensions of the results of this 
paper as well as open directions. We separately include all mathematical details ver- 
ifying the main theory in Section 9, while leaving some auxiliary verifications to the 
appendix. 

2. Preliminaries 

2.1. Main Setting and Basic Notation 

The noiseless setting of the paper is obtained by independently sampling a data set X 
of N points from a uniform mixture measure fi (see Definition 1.1). We often partition 
X into the subsets {Xi}fL with {Ni}fL Q points sampled according to the measures 
{Mi}^=o use d m tne definition of /i. We remark that in Theorem 4.1 we will directly 
sample from ^ and fj,i, instead of the uniform mixture measure [i. 

We will inquire whether the best Iq subspace for X is a local l p subspace or a global 
l p subspace w.o.p. By global and local l p subspaces we mean local or global minimum 
of the energy expressed in (2). We use both terminologies of global l p subspace and 
best l p subspace to describe the same thing. 

We sometimes apply the energies (2) and (3) to a single point x, while using the 
notation: e /p (x,L) = e lp ({x}, L) and e ip (x, L 1; L 2 , • • ■ , L K ) = e /p ({x}, L l5 L 2 , 
••■ ,L K ). 

We denote possibly large scalars by upper-case plain letters (e.g., N, C) and scalars 
with relatively small values by lower-case Greek letters (e.g., a, e); vectors by boldface 
lower-case letters (e.g., u, v); matrices by boldface upper-case letters (e.g., A); sets by 
upper-case Roman (e.g., L) or calligraphic letters (e.g., X) and measures by lower-case 
Greek letters (e.g., p,, Bu and jD,d)- We often distinguish between different constants 
within the same proof, but may use the same notation for different constants of different 
proofs. 
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In addition to the shorthand w.o.p., we use the following ones: w.p. for "with prob- 
ability", w.r.t. for "with respect to" and WLOG for "without loss of generality". 

We denote the Euclidean norm of x 6 M. D by ||x|| and the ball centered at x £ M. D 
with radius r by B(x, r). For any c > 0, we let c • B(x, r) := B(x, c • r). 

The (i, j)-element of a matrix A is denoted by Aij. The transpose of A by A T 
and that of a vector v by v T . The Frobenius and nuclear norms of A are denoted by 
||A|| F and |j A||* respectively (the former one is the square root of the sum of squares 
of singular values of A and the latter one is the sum of singular values). The n x n 
identity matrix is written as I„. We designate the orthogonal group of n x n matrices by 
0{n) and the semigroup ofrixn nonnegative scalar matrices by S+(n). We denote the 
subset of S+(n) with Frobenius norm 1 by NS+(n). If m > n we let 0(m, n) = {X g 
R mx„ . x t x = i n } > W hereas if n > m, 0(m,n) = {X e M mxn : XX T = I m }. 

If L is a subspace of M. D , we denote by L 1 - its orthogonal complement. We designate 
the projection from M. D onto L and L 1 - by Pl and P^ respectively. If x e ~KP ', we use 
dist(x, L) to denote the orthogonal distance from x to L. We define the scaled outlying 
"correlation" matrix Bl,a? of a data set X and a d-subspace L as follows 

B L ,*= a(x)F L ± (x) T /dist(x,L). (5) 

xex\L 

We will also use the following operator: 

D L , X , P = P L (x)P^(x) T dist(x,L)^- 2 ). (6) 



2.2. Principal Angles, Principal Vectors and Related Notation 

We denote the principal angles [21] between two <i-subspaces F and G by tt/2 > 6\ > 
2 > ■ ■ ■ > 0d > 0, where we order them decreasingly, unlike common notation. We 
denote by k = fc(F, G) the largest number such that Ok ^ 0, so that 9i > . . . > Ok > 
9k+i = ■ ■ ■ = 9d = 0. We refer to this number as interaction dimension and reserve 
the index k for denoting it (the subspaces F and G will be clear from the context). We 
recall that the principal vectors {~Vi]f =1 and {v^}f =1 of F and G respectively are two 
orthogonal bases for F and G satisfying 

(vi, w'i) = cos(0i), fori = l,...,d, 

and 

Vj-LVj-, for all 1 < i ^ j < k. 

We define the complementary orthogonal system {ui}f =1 for G with respect to F 
by the formula: 

v£ = cos((9j)vi + sin(6»j)Uj, i = 1,2, 
Uj = Vj, i = k + 1, • • • , d. 

We note that 

Uj _L Vj for all 1 < i,j < k . 
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We thus orthogonally decomposed F+G into the 2-dimensional subspaces Sp(vj, Uj), 
i = 1, . . . , k, of mutually orthogonal systems and the residual subspace F n G. The 
interaction between F and G can then be described only within these subspaces via 
the principal angles. This idea is also motivated by purely geometric intuition in [47, 
Section 2]. 



2.3. Grassmannian, Invariant Metric and Geodesies 

The Grassmannian space G(D, d) is the set of all (i-subspaces of M. D with a mani- 
fold structure. Throughout the paper we implicitly use principal vectors to represent 
G(D, d) by 0(d) x 0(d, D - d) x S+(d). Indeed, we fix a d-subspace Li e G(£>, d) 
and for any L G G(D, d) we form the principal vectors {vj}f =1 and {v^}^ =1 for Li 
and L respectively; the projection of {vj}f =1 onto Li corresponds to an element of 
0(d); the projection of {v^}^ =1 (or the complementary vectors {uj}f =1 of L w.r.t. Li) 
onto Lf- gives rise to an element of 0(d, D — d); The principal angles in S+ then re- 
late elements projected onto Lf and Li. Our representation is rather different than the 
common representation in numerical computation [18, Table 2.1], which uses either of 
the quotient spaces: 0(D,d)/0(d) or 0(D) /(0(d) x 0(D - d)). 

We will measure distances between F and G in G(D, d) by the following metric 



dist(F, G) 



This distance was suggested in [47] as invariant metric since it measures the geodesic 
distance in G(D, d) between the corresponding subspaces [47] as one can see from (9) 
below. 

It follows from [47, Theorem 9] that if the largest principal angle between F and 
G is less than 7r/2, then there is a unique geodesic line between them. Following [18, 
Theorem 2.3], we can parametrize this line from F to G by the following function 
L: [0,1]— > G(D, d), which is expressed in terms of the principal angles {9i]f =1 of F 
and G, the principal vectors {v;}^ =1 of F and the complementary orthogonal system 
{ u liLi °f G with respect to F: 

L(t) = Sp({cos(t0i)v< + sin(^;)u 4 }ii)- (9) 

If L e G(D, d), we denote by B(L, r) the closed ball in G(D, d) around L with 
radius r. We also denote by Be(B(L, ri), r 2 ) the Euclidean ball around B(L, ri), i.e., 
the set of all points in M. D whose distance from the set UL/ G B(L.ri)L' is at most r-i. 

We will use the natural probability measure on the Grassmannian, commonly de- 
noted by 7 Ad [36]. We recall that for any fixed F G G(D, d) and any A C G(D, d): 

7D,d(A) - 9d({B e 0(D) : BF e A}), 

where 8rj is the Haar measure on 0(D), so that for any x € S ^ 1 (where S D_1 
is the (D — 1) -dimensional unit sphere with uniform probability measure tr D_1 ) and 
ECS " 1 : 

D ({B e 0(D) : Bx e E}) = a d -\B). 
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3. Counterexamples and Conditions for Robustness of l p Subspaces 
3.7. Counterexamples for Robustness of Best l p Subspaces 

We show here that there are many natural situations, though different than our underly- 
ing model of uniform outliers, where best l p d-subspaces are not robust to outliers for 
all < p < oo. More precisely, we show how a single outlier can completely change 
the underlying subspace. 

A typical example includes Ni points sampled independently and uniformly from 
a d-dimensional ball in M. D centered around the origin with radius e and an additional 
outlier located on a unit vector orthogonal to that <i-subspace. By choosing e sufficiently 
small, e.g., e i5 (l/iVi) 1 ^, the best l p subspace passes through the single outlier and 
is thus orthogonal to the initial <i-subspace for all p > 0. 

If p — 1, then the best Iq <i-subspace in this example is still a local l\ subspace. 
Nevertheless, if the outlier is located instead on a unit vector having elevation angle 
with the original d-subspace less than tt/2, then e can be chosen so that the best Iq 
subspace is neither a local nor global l\ subspace. However, if < p < 1, then the best 
Iq subspace is still a local l p subspace in both examples as well as almost any other 
scenario (see e.g., Proposition 3.1 below). 

Similarly, it is not hard to produce an example of data points on the unit sphere of 
MP where the best Iq subspace is still not a best l\ subspace. This is in contrast to 
the case of sparse representation of signals, where normalization of the column vectors 
of a matrix representing an undercomplete linear system of equations ensures that the 
solution minimizing the 1% norm is also the sparsest solution as long as it is sufficiently 
sparse [17, Theorem 2]). For simplicity we give a counterexample for d = 2 by letting 
Ni data points be uniformly sampled along an arc of length e of a great circle of 
the sphere S 2 C R 3 . We then place an outlier on another great circle, which passes 
through the center of the e-arc and has a small angle with it. Taking e sufficiently small 
and the outlier furthest from the intersection of the two great circles, then the best Iq 
subspace is not a local l\ subspace and consequently not a global one. We remark that 
in this example both assumptions of this paper requiring uniformity of outliers (or more 
generally symmetry around the origin; see Section 8) and symmetry around the origin 
of inliers (see again Section 8) are not satisfied. 

3.2. Combinatorial Conditions for Iq Subspaces being Local l p Subspaces 

We formulate conditions for the best l subspace to be a local l p subspace, while dis- 
tinguishing between three cases: p = 1, < p < 1 and p > 1. We prove these results 
in Section 9.2. The most interesting condition is when p = 1, which we describe as 
follows. It uses notation introduced in Section 2, in particular, the scaled outlying "cor- 
relation" matrix Bl,at of (5). 

Theorem 3.1. IfU e G(D,d), X x = {xj^ e U, *o = {yi}f=i € R D \ Li and 
X = Xq U X\, then a sufficient condition for Li to be a local minimum of (X, L) 
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among all d-subspaces L G G(D, d) is that for any V G 0(d) ami C G S+(d): 

^HCVPl^xOH > ||CVB Ll ,*||*. (10) 

i=l 

The next proposition shows that for p < 1 the best Zq subspace is almost always a 
local l p subspace. 

Proposition 3.1. IfU G G(D,d), #i = {xj^ G L x , -*b = {yi}^ el D \ U, 
Sp({xj} i ^ 1 ) = U and p < 1, f/zen Li /s a foca/ minimum of ei p (X,L) among all 
L G G(£>,d). 

At last, for p > 1 we establish a necessary condition for the best Iq subspace to be a 
local l p subspace. This condition is rather degenerate and often cannot be satisfied. 

Proposition 3.2. IfU G G(D,d), X x = {xi}^ G U, X = {yi}^ eI fl \L, and 
p > 1, f/ie« a necessary condition for Lj to fee a /oca/ minimum of e; (A", L) among 
allLe G(D,d) is 

N 

^PL 1 (yi)PL 1 (yi) T &s<yM p - 2 =o. (ii) 

i=l 

The above results manifest a phase transition phenomenon. Indeed, the best ?o sub- 
space is almost always a local i p subspace for p < 1, whereas for p > 1 this is often 
not the case (except for an underlying measure which is symmetric in the complement 
of Li; for example, in the case of an underlying uniform mixture with K = 1, the best 
Iq subspace is asymptotically a best l p subspace for all p > 0). The combinatorial con- 
dition implying when it is a local li subspace is more complicated and we exemplify 
its application throughout the paper. 

4. Best l Subspaces as Local or Global l p Subspaces for Uniform Sampling 

We assume here the probabilistic setting of uniform mixture measure with a single 
underlying subspace Li, i.e., K = 1. Clearly, Li is the best Iq subspace for the sampled 
data w.o.p. For any p > 0, we ask whether Li is also a local or even global l p subspace 
w.o.p. We prove the corresponding results described below in Section 9.3. 

We first claim that for p = 1 the best Iq subspace is a local l p subspace w.o.p. as 
long as the fraction of inliers is sufficiently large. 

Theorem 4.1. If 'Li G G(D, d) and X is a data set in M. D of Nq + Ni points, where 
Nq of them are uniformly and independently sampled from the unit ball B(0, 1) in M. D 
and Ni of them are independently and uniformly sampled from B(0, 1) fl Li; Then Li 
is a local l\ subspace of X w.p. at least 

i - 2d 2 cxp f-^r) - 2dDc *P ("^) ' where *> + W 1 e< 2/{2d + 3) • 
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In particular, if Nq = o(iVf ), then Li is a local l\ subspace of X w.p. at least 



For < p < 1, Proposition 3.1 implies that if Ni = 0(1) then Li is a local l p 
subspace w.o.p. On the other hand we claim next that if p > 1 and N% = 0(1), then 
the subspace Li is a local l p subspace w.p. 0. 

Proposition 4.1. Consider Li £ G(D, d), po a uniform distribution in B(0, 1) C M. D , 
fii a uniform distribution on Li n B(0, 1), /.t = oqPq + aipi, where < ao < 1, 
oi\ = 1 — ao a«af A" is a c/ata sef sampled independently from p. If p > 1, f/ie« the 
probability that Li is a /oca/ Z p subspace of X is 0. 

The proof of this proposition is rather immediate. Indeed, the outliers, denoted 
by {yi}^ 1 , have uniform distribution p\, which has a bounded and nonzero proba- 
bility density function for vectors in the unit D-dimensional ball. Therefore for any 
V e G(D, d) the joint probability density function of -PL'(yi)-Pf/(yi) Tdist ( 
yi, L') p_2 is also bounded and nonzero on the corresponding range and thus (11) has 
probability 0. 

Another question is whether the best Iq subspace is also the global l p subspace. 
Proposition 4. 1 and Theorem 1 . 1 already answered this question in our setting. Indeed, 
if p > 1, then by Proposition 4.1 the best Iq subspace is a global l p subspace with 
probability 0; whereas if < p < 1, then Theorem 1.1 with K = 1 implies that for 
iVo = O(Ni) the best Iq subspace is also the best l p subspace w.o.p. 

We formulate this special case of Theorem 5.1 below and prove it separately. We 
believe that it is easier to digest the whole proof of Theorem 1.1 by first following it 
for this special case and later generalizing it. 

Theorem 4.2. IfLi £ G(D, d), po is a uniform distribution in B(0, 1) C R 15 , p 1 is a 
uniform distribution on Li (1B(0, 1), p, = oiqPq -\-UiPx, where ao, ot\ are nonnegative 
numbers summing to 1 and X is a data set independently sampled from p, then Li is a 
best l p subspace for X w.o.p. for any < p < 1. 

At last, we remark that the phase transition phenomenon demonstrated above at 
p = 1 is rather artificial in the current setting. Indeed, this phase transition is based 
on the fact that (11) holds w.p. for p > 1 and any finite sample; however, the LHS 
of (1 1) divided by N is w.p. 1 as approaches infinity. Moreover-, when p > 1 the 
positive distance between the best Iq subspace and the best l p subspace approaches as 
N approaches infinity. We will show in Theorem 5.1 that this formal phase transition 
also breaks down with noise. Nevertheless, as we show in Theorem 6.1, there is a clear 
phase transition for a uniform mixture model with K > 1. This is rather intuitive since 
the underlying measure of the latter case is not symmetric on the complement of Li, 
unlike the case where K = 1. 

5. Extension of the Theory to Noisy Setting 

We present here extensions of previous results, in particular, Theorems 1.1 and 1.2, to 
the setting of independent samples from uniform mixture measure of noise level e > 0. 




(12) 
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We prove those extensions in Section 9.5. 

In this noisy setting, Theorem 1.1 is still valid up to a recovery error proportional 
to the noise level e. In fact, if K = 1, then such a near-recovery generalizes to all 

< p < oo. 

Theorem 5.1. If e > 0, fi e is a uniform mixture measure of noise level e on M D with 
K d-subspaces {L^}^ C M D and mixture coefficients {q:,-}|Lq, X is a data set of N 
points sampled independently from fi e and < p < 1, then the best l p subspace for [i t 
is in the ball £?(Li, /), where 

2 p a^e 



(«1 - E*=2 a i) P 

w.p. at least 1 — Ccxp(— N/C), where C = C(e,p, d, D, ao, ai, min2<i<A"(dist 
(Li) Lj)). 

If K = 1, f/ien f/ie above statement extends for all < p < oo vw'f/z 

3+p 3 / r> \ 7 j_ 
/ = /(e,X,d,p,a ,ai) = 2 * — e». 



Remark 5.1. If0<p< 1 and 



• A' 

2— d 



e> ^g^ (13) 



or p > 1, /\ = 1 and 



f/ien / > 2^^, which implies that B(Li, /) = G(D, d) (since all principle angles are 
at most tt/2). It thus makes sense to restrict the level of noise to be at least lower than 
the right hand sides of ( 1 3) or ( 14). 

Theorem 1.2 also extends to uniform mixture measures with restricted noise level. 
This restriction on e is expressed in the theorem below, while using the following con- 
stant: 

7o= 1+ * (15) 

Theorem 5.2. Let e > 0, /i e a uniform mixture measure of noise level e on M D with K 
d-subspaces, {L i }fL 1 C MP as well as mixture coefficients {cti}fL and X a data set 
of N points sampled independently from If < p < 1 and 



e<3 p t min aj min dist p (L ? , Lj)/2 P — a ) , (16) 
then the minimizer of (3) in G(D, d) K has a distance smaller than 

f = /(e, K, d,p, {a.i}f =1 ) = 3p (t min dj - a ) e (17) 



l<j<K 

from one of the permutations of {h\, L2, • • • , L^-). 
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6. The Phase Transition at p — 1 

Theorems 1.1, 1.2,5.1 and 5.2 established the recovery and near recovery of the best Iq 
subspace as well as all underlying subspaces by l p minimization whenever < p < 1. 
We also showed that for a single subspace, i.e., K = 1, near recovery extends to p > 1 
(see Theorem 5.1) and exact recovery asymptotically extends to p > 1, but is never 
realized (see Section 4). Here we establish the impossibility of such l p recoveries when 
p > 1 and K > 1 and thus demonstrate a phase transition at p = 1 when K > 1. We 
prove all statements in Section 6. 

We first claim that the best subspace cannot be effectively recovered or nearly 
recovered by l p minimization when p > 1 and K > 1. That is, we establish a phase 
transition of the l p recovery of the best Iq subspace at p = 1. 

Theorem 6.1. Assume that {L;}^ are K d-subspaces in M. D , which are indepen- 
dently distributed according to jD,d- For each e > and a random sample of {Li}fL lt 
let p e be a uniform mixture measure of noise level e (or without noise when e = 0) on 
M. D w.r.t. C R D and let X be a data set of N points sampled independently 

from p e . If K > 1 and p > 1, then for almost every (w.r.t. "fp d ), there exist 

positive constants Sq and kq, independent of N, such that for any < e < <5o the best 
l p subspace of X is not in the ball B (Li, Kq) with overwhelming probability. 

Remark 6.1. The above constants So and kq depend on other parameters of the un- 
derlying uniform mixture model in particular the underlying subspaces {Li}^. For 
example, in the case of p > 2 one can estimate from below both kq and So by the 
following number: 

IIEt2"^(D Ll ,x, P )||| 

dD2P+ 5 

where Dl 1iX , p is defined in (6) and for any i = 1, . . . 1 K, pi_ e is obtained by pro- 
jecting pi e onto the subspace Lj. (that is, for any set E C B(0, 1) n L^: /ij e (E) = 
p ite (P^(E))). 

Next, we formulate the impossibility of recovering all underlying d-subspaces by l p 
minimization for p > 1 and thus demonstrate a phase transition in this case. 

Theorem 6.2. Assume that {Li}*L 1 are K d-subspaces in MP , which are indepen- 
dently distributed according to "fD.d- For each e > and a random sample of {Li}fL lt 
let p e be a uniform mixture measure of noise level e (or without noise when e — 0) on 
M. D w.r.t. {\ji]f = i C MP and let X be a data set of N points sampled independently 
from p e . If p > 1 and K > 1, then for almost every (w.r.t. 7^ d ) there exist 

positive constants So and Kq, independent of N, such that for any e < So the minimizer 
of (3), Li, L2, • ■ • , lux, satisfies w.o.p.: 

dist((Li,L 2 , • • • ,Lk), (Li,L 2 , . . . ,L>k)) > «o ■ 

7. Implications of the Theory for Subspace Modeling 

We discuss the implications of the theory described above for robust HLM and even 
for the simpler case of robust modeling by a single subspace. Since we study here only 
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particular distributions, we cannot fully explain the general behavior of the algorithms 
mentioned below. Nevertheless, we can still provide some quantitative explanations of 
their performance and clarify situations where it is necessary to use the values < p < 
1 for efficient l p minimizations. 

A very common algorithm for recovering a d-subspace in a point cloud is the RANS AC 
algorithm[19].Its simplest version repeatedly applies the following two steps: 1. ran- 
domly select a set of d independent vectors; 2. count the number of data points within a 
strip of width e around the d-subspace spanned by those d vectors (both e and the num- 
ber of iterations of these two steps are parameters set by the user). The final output of 
this algorithm is the d-subspace maximizing the quantity computed in step 2. Almost 
all other variants of RANSAC assess the best d-subspace by the same quantity, which 
depends on the unknown parameter e. 

Torr and Zisserman [43, 44] have suggested a RANSAC-type strategy which mini- 
mizes a variant of the 1-2 distance from a subspace. This variant uses the square function 
until a fixed threshold and a constant function for larger values. 

Theorems 1.1, 5.1 and 6.1 provide some insights on the effectiveness of recover- 
ing the best Iq d-subspace (or best lo strip of width e) in a uniform mixture setting 
by minimizing l p distances in the spirit of [43, 44]. In particular, they imply that if 
K > 1 then only l p distances with < p < 1 should be considered in the latter 
setting. Even distances that coincide with the I2 distance for sufficiently small values, 
such as [43, 44] or Huber's loss function [24], will not recover the underlying sub- 
spaces as the proofs of those theorems show. On the other hand, for a single underlying 
subspace in point clouds with possibly additive noise, l p recovery should succeed in 
theory for any < p < 00, though the bounding constants worsen as p increases. The 
idea of [43, 44] making the loss function constant for large values is expected to help 
with significantly far and nonuniform outliers that are not covered by our model. Such 
outliers are discussed e.g., in Section 3.1. 

For the recovery of multiple subspaces, RANSAC has been repeatedly applied in [49], 
while removing the points around the subspace found at the current iteration and pro- 
viding the reduced data for the next one. Numerical results in [51] show that this strat- 
egy is both accurate and fast for some artificial data when setting the RANSAC param- 
eter e to be the model's noise level. However, in practice, the noise level is unknown. 
Da Silva and Costeiranuno [13] have suggested an alternative numerical optimization 
over the Grassmannian to iteratively estimate subspaces, while avoiding the RANSAC 
procedure. However, their method seems to be sensitive to local minima and there is no 
obvious interpretation for their objective function. On the other hand, for the particular 
setting of uniform mixture measures with noise, Theorem 5. 1 provides a clear interpre- 
tation for the l p minimization and also guarantees its stability. However, in practice, we 
may apply such setting only when recovering the best Iq subspace among all underlying 
K subspaces. 

Rigorous application of Theorem 5. 1 for iterative recovery of the rest of the under- 
lying subspaces requires the extension of this theorem to more general scenarios; such 
an extension depends on the precise way of removing the part of the data around a 
subspace (see some relevant though not sufficient extensions in Section 8). 

On the contrary, Theorems 1.2 and 5.2 explain the simultaneous minimization of 
subspaces via the energy (3). Zhang et al. [50] suggested a stochastic gradient descent 
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approach for approximating this minimization problem (only p = 1 is discussed there, 
but their method applies to any < p < oo). They have demonstrated robustness 
to outliers for artificial data sets. More recently, [51] described multiscale geometric 
strategies for forming candidate <i-subspaces. They then select the best K rf-subspaces 
by minimizing the energy (3) among all such candidates (or many of them). Their 
choice of candidates is justified in [51, Theorem 1], i.e., they show that among the 
large set of candidate subspaces there are K subspaces closely approximating the true 
underlying subspaces. On the other hand, their use of li minimization to find the best 
approximating candidates to the true subspace is justified by Theorems 1.2 and 5.2 for 
particular sampling rules. 

8. Discussion 

We studied the effectiveness of l p minimization for recovering both the best Iq subspace 
and all underlying K subspaces with overwhelming probability when independently 
sampling from a uniform mixture measure. A probabilistic setting was necessary since 
we also described some typical cases where best l p subspaces are different than best Iq 
subspaces for all < p < oo. We also showed how to generalize this study in order 
to nearly recover the subspaces in the case of additive uniform noise. Furthermore, we 
demonstrated a phase transition phenomenon around p = 1 for l p recovery of the best 
lo subspace when K > 1. Our analysis has provided some guarantees for the robustness 
to point clouds of some recent HLM algorithms as well as single subspace recovery. 

There are many possibilities to extend this work and we would like to discuss some 
of these directions here. 

More general distributions. It will be interesting to extend our probabilistic results 
to more general distributions, i.e., distributions that are not purely uniform. We discuss 
here some of these generalizations, which are apparent from the proofs of the theory. 
We first note that our results extend with weaker bounds to approximately uniform 
distributions, i.e., distributions whose pdf's are bounded away from and oo on the 
corresponding regions. By weaker bounds, we mean for example that the lower bound 
on oto in Theorem 4.2 (i.e., ao > 0) and more generally the ratio between the LHS and 
RHS of ( 1 ) in Theorem 1 . 1 need to increase (depending on the upper and lower bounds 
of the underlying pdf's). 

Moreover, it is clear that the uniformity along subspaces can be generalized to uni- 
formity (or approximate uniformity) along spheres around the origin. More precisely, 
we may assume that all have the same distribution (up to rotation) with a radi- 

ally symmetric pdf (or approximately so). For example, one can use the same spherical 
Gaussian distribution along subspaces. Similarly, the assumption of uniform outliers 
can be relaxed by assuming that the pdf of /i is spherically symmetric around the 
origin. When exploring when the best l subspace is a local l p subspace, e.g., as in 
Theorem 4. 1, then it is sufficient to ask that the pdf of fio is symmetric with respect to 
Li. More precisely, such a symmetry requires that E^IDl^x^) = for all p > 0. 
For example, this pdf may obtain the same values on all points in the unit ball with the 
same distance to L±. Alternatively, it may obtain the same values on all points within 
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the unit ball on the boundary of cones in R centered on Li at the origin (such cones 
are defined e.g., in [31, Section 2.1]). 

By a further weakening of Theorem 1.1, it is possible to replace the uniformity 
(or approximate uniformity) of outliers in the unit ball by uniformity (or approximate 
uniformity) of the projection of outliers onto the best Iq subspace Li. This will require 
a sufficiently large lower bound on aci, in particular, larger than 0.5. This lower bound 
has to depend on the maximal distance of outliers from Li. 

The noisy setting can be extended with weaker bounds to densities of the form 
/(x) = /(PLi(x), P L x(x)) = g(P^± (x))/i(Pl; (x)), where h is uniform (or radially 
symmetric) and g decays sufficiently fast, e.g., g is the pdf of a normal distribution. In 
fact, if g is chosen from a parametric family with parameter e, then it is sufficient to 
assume that the L\ norm of the random variable with density g approaches zero as e 
approaches zero. 

The case of afflne subspaces. Our analysis was restricted to linear subspaces, 
though it can be formally extended to affine subspaces intersecting a fixed ball fully 
contained in B(0, 1), e.g., the ball B(0, 1/2). Indeed, we can consider the affine Grass- 
mannian [36, 29], which distinguishes between subspaces according to both their off- 
sets with respect to the origin (i.e., distances to closest linear subspaces of the same 
dimension) and their orientations (based on principal angles of the shifted linear sub- 
spaces). The assumption above on the affine subspaces (i.e., their offsets are less than 
1/2) restricts them to be in a compact subspace of the affine Grassmannian as neces- 
sary to our analysis. Nevertheless, it is not obvious whether the metric on the affine 
Grassmannian is relevant for our applications, since it mixes two different quantities 
of different units (i.e., offset values and orientations) so that one can arbitrarily weigh 
their contributions. We remark that the common strategy of using homogenous coordi- 
nates which transform d-dimensional affine subspaces in R D to (d + 1) -dimensional 
linear subspaces in R £>+1 is not useful to us since it distorts the structure of both noise 
and outliers. 

A related problem of interest to us is to explain why different variants of both the K- 
subspaces and iterative RANS AC (for HLM) do not perform well with affine subspaces 
as they do with linear ones. It is clear though that the analysis of the A'-subspaces al- 
gorithm is different in the two cases. Indeed, the required analysis needs to deal with 
sets of points closer to a given subspace among all underlying subspaces, namely the 
regions {Yj}f =1 of (96). For linear subspaces the boundaries of such regions are poly- 
hedral surfaces, whereas for affine subspaces they are piecewise quadratic. 

Further performance guarantees for Z p -based HLM Algorithms. 

The MKF algorithm [50] attempts to minimize the energy (3) for p = 1. The theory 
described here advocates such minimization. However, in practice, the MKF applies a 
stochastic gradient descent for approximating the minimum value. We are interested in 
direct study of convergence as well as robustness to outliers of this iterative approxi- 
mation. 

Another iterative method based on L subspace minimization is the A"-subspaces 
algorithm [25, 4, 45, 23]. It minimizes a function of both the K d-subspaces and the 
K clusters. Consequently, it can be more sensitive to initializations of the clusters. In 
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particular, it seems hard to generalize Theorems 1.1 and 5.1 to provide performance 
guarantees for the Z p -based A'-subspaces algorithm with underlying linear subspaces. 
We are also curious about even partial analysis for this algorithm in the case of mixed 
dimensions. 

9. Verification of Theory 

We describe here the complete proofs of the various theorems and propositions of this 
paper. 

9.1. Auxiliary Lemmata 

We formulate several technical lemmata, which will be used throughout the proofs of 
the following sections. Their proofs appear in Appendices A. 1-A.4. 

Lemma 9.1. Suppose that Li, Li, La, • • • , hx G G(D, d), p > and p\ is a uniform 
distribution in B(0, 1) n Li. 7fmiiii<_,<A' dist(Li, hj) > e, then 

E, t (ej^x.Li.La,... ,£,*)) > ■ 

Lemma 9.2. For any x G B(0, 1) and Li,L2 G G(D,d): 

|dist(x,Li) - dist(x, L 2 )| < ||x|| dist(Li,L 2 ). 

Lemma 9.3. If hi, L2 G G(D, d), Xi, x 2 are uniformly distributed random variables 
in B(0, 1) l~l Li, B(0, 1) n L2 respectively and p < 1, then for any L € G(Z?, d): 

E(dist(xi,L) p )+E(dist(x 2 ,L) p ) > E(dist(xi, L 4 ) p ) +E(dist(x 2 ,L l ) p ) fori = 1,2. 

(18) 

The next lemma uses the constant tq of (15) and the following notation w.r.t. the 
fixed rf-subspaces Li, L2, • • • ,Lx ,Li,L2,- • • ,hx G G(D, d): 

I(i) = argmin 1 < j < Kr dist(L i ,L J ) VI < i < K (19) 

and 

d Q = min dist((L ll ,L ! ; 2 ,--- ,L iK ), (Li,L 2 , • • • ,h K ))- (20) 

11,12,--- .ikCPk 

Lemma 9.4. Suppose that hi, L 2 , • • • , L^-, Li, L 2 , • • • , hx G G(D, d) and < p < 
1. If (1(1), ■ ■ ■ , I(K)) is a permutation of (1, ■ ■ ■ , K), then 

E li ei (x.,hi,h 2 ,- ■■ ,h K ) -E^ei (x,L 1 ,L 2 ,-- • ,h K ) > (r min a } ■- a ) d%. 

On the other hand, if (1(1), ■ ■ ■ , I(K)) is not a permutation of (1, ■ ■ ■ , K), then 
£>%( x ) hi,h%, ■ ■ ■ ,Lje) — Efj,ei p (x, Li,L2, • • • ,hx) 

>Tn min a, I min dist(L, , L,)/2 I -an. 
" U<3<x 3 J \i<i,j<K K 3,1 J 
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9.2. Theory of Section 3 

9.2.1. Proof of Theorem 3.1 

In order to show that Li is a local minimum of ei 1 (X , L) among all d-subspaces in 
G(D, d), we arbitrarily fix a d-subspace L <G B(Li, 1) and show that the derivative of 
the l\ energy when restricted to the geodesic line from Li to an arbitrary subspace L is 
positive at Li. 

The restriction of L to B(Li, 1) implies that Q\ < 1 and thus by [47, Theorem 9] 
this geodesic line (connecting Li and L) is unique. We parametrize it by the function L: 
[0,1]— > G(D, d) of (9), where here {0{}f =1 are the principal angles between Li and L, 
{vi}f =1 are the principal vectors of Li and {u}^ =1 are the complementary orthogonal 
system for L with respect to Li. Using this parametrization we need to prove that the 
function (X , L(i)): [0,1]— > R has a positive derivative at t = 0. 

We follow by simplifying the expression for the function ei ± (X, L(t)) and its deriva- 
tive according to t. We denote the projection from M. D onto Sp(vj , uy), where 1 < j < 
d, by Pj and the projection from M. D onto (Li + L) by P and use this notation to 
express the following components of the function (X, L(t)) for i = 1, . . . , N\\ 

d 

dist(yi,L(t)) = ^dist 2 (P,( yi ),L(t)) +dist 2 (P^(y. 1 ),L(t)). (21) 
\ i=i 

For 1 < j < d, we let (f>j e [0, 2ir] denote the angle such that Pj(y») = ||Pj(yi)|| 
(cos((f>j)vj + sm((f)j)uj) and consequently express each term of the sum in (21) as 
follows: 

dist 2 (P,(y 4 ),L(t)) = ||Pi(yi)|| 2 sin 2 (& -tOj), j = l,...,d. (22) 

Applying (22) in (21) and differentiating, we obtain the following expression for the 
derivative of dist(yj, L(i)) for all 1 < i < No: 

d ... , .... EU e i\\ p ibi)\\ 2 sin (^' - te i) cos (^ - te i) 

- (±«(y it L m 

2~2j=i Oj (( c °s(^i)vj + sm(t0j)uj) ■ yi) ((- sm(t6j)vj + cos(t0j)uj) ■ yj 



dist(yi,L(t)) 



(23) 



At t = it becomes 



^(dist(y<,L(t))) 



2l, i^ jv ; ■y.lK -Yi) 
t=0 " dist(y t ,L(0)) 

V'; , </,ix, ■y i )(u j •>•.. 



dist(y l; L(0)) 

where the interaction dimension k = fc(Li, L) has been introduced in Section 2.2 



(24) 
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We form the following matrices: C = diag(#i, 6 2 , ••■ , d ), V G 0(d, D) with j-th 
row vj and U G 0(k, D) with j-th row uj. We then reformulate (24) using these 
matrices as follows: 

tr fc (CVy iy fU T ) 



elf 



(dist( yi ,L(t))) 



t=o 



dist(y. 1 ,Li) 



(25) 



where tr* denotes the trace of the first k rows of the corresponding d x k matrix, whose 
last d — k rows are zeros. Similarly, for all Xj G Li, i = 1, 2, ■ ■ ■ , iVi, 



dist(xi,L(t)) = 



^Kv^Xi^sin 2 ^-), 



and 



df 



(dist(xi,L(t))) 



g^i gj|vj ■ x,| 2 sin(f%) cosftgj) 
dist(xi,L(i)) 



At i = 0, this derivative becomes 
d 



(1/ 



(dist(xj,L(t))) 



t=o 



12 fl2 



ICVx, 



Combining (25) and (27) and using 



N 



A :=^]yfy J /dist(y l ,L 1 ) 



(26) 



(27) 



we obtain the following expression for the derivative of the l\ energy of (2): 

JVi 



df 



MX Mi))) 



= ^||CVx I ||-tr fe (CVAU T ) 



(28) 



Since V is a projection onto Li and U is a projection onto Lf- , we may rewrite 
this expression by the matrix V G 0(d), whose j-th row is P\ J1 (vj) T and the matrix 
U G 0(k, D - d), whose j-th row is (vj) 



T. 



df 



(e h (XMt))) 



JVi 



= ^ HCVP^x.H - tr fc (CVB Ll ,*U T ). 



At last, we note that 



max(tr fc (CVB Ll ^U T )) = ||CVB Ll .*| 

IJT 



(29) 



(30) 



Indeed, denoting the SVD decomposition of CVBl 1 ,;£ by UoSoV^ we have that 
tr fc (CVB Ll ,*U T ) = tr fc (U £ V^U T ) = tr fc (£ V^U T Uo) < ^(diag(S )) 
= ||CVB Ll ,^||* , 

and this equality can be achieved when U T consists of the first k columns of VoU^. 
The theorem is thus concluded by combing (29) and (30). □ 
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9.2.2. Simultaneous Proof for Both Propositions 3.1 and 3.2 

For the d-subspace Li and an arbitrary d-subspace L E B(Li , 1), we form the geodesic 
line parametrization L(t) and the corresponding matrices C, V, U, V and U as in the 
proof of Theorem 3.1. Similarly to verifying (25) and (27) in the latter proof, we obtain 
that 



A( dist ( y . > L(t))P) 



t=o 



and 

^(dist(x l; L(t)f) 

Consequently 



>dist(y 4 , U)P- 2 tr fe (CVy jy f U T ) (31) 



= pdist(x l ,L 1 )?'- 1 ||CVx l ||. (32) 

t=o 



= p^dist(x l ,L 1 )f- 1 ||CVx 4 || (33) 

t=o i=l 
N JVi 

- P J2 dist(yi, Lx)f- 2 tr fe (CVy l yfU T ) = p]T dist(x ? , L^ 1 ||CVP Ll (x,)| | 

i=i t=i 

N 

- dist(y i; Lxf- 2 tr fe (CVP Ll (y,)i^ (y 4 ) T U T ). 
i=i 

Assume first that p < 1. Then 
d 



d£M*.L(*))) 



= pt 1 -^ dist(x l; Lx^IICVPl, (xi)|| (34) 

t=0 i=l 



iV, 



■pt 1 -* £ dist(y 4 , Li)^ 2 tr fc (CVP Ll (y^ (y 4 ) T U T ) 

i=l 

= pg(limdist(x,,L(i))/i) ||CVP Ll ( Xl )|| =£||CVP Ll ( Xi )|r. 



U-s-0 

»=i ?=i 
It follows immediately from the definitions of C and V that 

||CVxi|| >0i||vf Xi ||. (35) 

Now, the assumption Sp({xi}^ : 1 1 ) = Li implies that there exists 1 < j < Ni such 
that vfxj ^ and thus ||CVP Ll (x ?; )|| = ||CVx !; || > 0. Therefore, (34) is positive, 
Li is a local minimum of e; (X, L(i)) and Proposition 3.1 is proved. 
Next, assume that p > 1 and note that 

Ni 

p^dist(x l ,Li) J5 - 1 ||CVP Ll x. i || =0. (36) 

i=l 
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Since Li is a local minimum of e; AX, L), the derivative in (33) is nonnegative and 
in view of (36), the subtracted term in (33) is thus nonpositive. Now, for a subspace 
L e G(D, d) such that C = V = I d we obtain that 

N 

> maxp^dist(y l ,L 1 )f- 2 tr fc (P Ll (y 4 )P L J - 1 (y l ) T U T ) 

^ i—1 



= p 



N 



where the last equality follows from (30). Therefore, (11) holds and Proposition 3.2 is 
thus proved. □ 

9.3. Theory of Section 4 

9.3.1. Proof of Theorem 4. 1 

To find the probability that Li is a local l\ subspace we will estimate the probabilities 
of large LHS and small RHS of (10) for arbitrary L G B(Li, 1). We use the similar 
notation as in the proof of Theorem 3.1, in particular, we denote the No outliers and 
N\ inliers by {yi}^ and {xi}^ respectively. Due to the homogeneity of (10) in C, 
we will assume WLOG that || C|| 2 = 1, i.e., 9i = 1. 

We start with estimating the probability that the RHS of (10) is small. Applying the 
above assumption that | |C| | 2 = 1 we have that 

||CVB Llj *|| F < ||VB Ll ,*|| F = ||B Ll ,*|| F 

and consequently 

CVB Ll ,*||* . \ > Pr / ||CVB Ll ,^|| F _e_\ 



V J ~ \ N ^dj 
> pr / ||B Ll ,^|| f _e_\ > J maxp, t |(B Ll ,^)p,t| e \ 
~ \ N " VdJ ~ V " dy/Dj ' 

We further estimate this probability by Hoeffding's inequality as follows: we view 
the matrix Bl 1: x as the sum of random variables f > L 1 (yi)-Pf^(yi) T /||-Pf^(yi)||, i = 
1, . . . , N . The coordinates of both T\ 11 (y,) and Pf^ (yi) T /| \Pf^ (yi)\\ take values in 
[-1,1] and their expectations are 0. We can thus apply Hoeffding's inequality to the sum 
defining Bl 1; ^ and consequently obtain that 

/ max„ r |(Bt . x)n i\ 6 \ / N n e 2 \ 
Pr p ' IV ^l2±J2il < — > 1 - 2dD exp . (37) 

V N dVDj ~ 1 V ZcPDj 



Next, we estimate the probability that the LHS of (10) is sufficiently large. For this 
purpose we make the following observations. First of all, 

Ni Ni Ni 

J2 l|CVP Ll (x,)|| > \0ivJP^i)\ = E Ivf^iWI 

i—1 i—1 i—1 
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> 



\ 



£ |vf P Ll (x 4 )| 2 > min<x t £ P u (x,)P Ll (x 4 ) T 

i=l \i=l / 



Second of all, as proved in Appendix A. 5: 

S All (P Ll (x)P Ll (x) T ) =S*I d , where J* = l/(d + 2). 
Last of all, as verified in Appendix A. 6: 



If max er t ^ P Ll (xj)P Ll (xj) T - 6*I d J < ??, 
then rnin cr t ^ P Ll (xO T J > 5* - ?/. 



(38) 



(39) 



(40) 



(41) 



We combine (38)-(40) and Hoeffding's inequality to obtain the following probabilistic 
estimate for the LHS of (10): 



(42) 



> Pr min a t 



Nx 



> 5* — r] 



> Pr I maxcr t I — d*I rf 1 < t] 



> Pr 



E 8 J I 1 l^ 1 (x i )PL 1 (x i ) 3 



A r i 



-<U„ 



< V 



> Pr I max 

p,l 



Nx S * Id 



p,i 



From (37) and (42), (10) is valid with probability at least 



2d 2 



1 — 2tf 2 cxp [— -7735 ) — 2dD exp I for any e, 77 such that 77 + ^— < 5* 



2d 2 L> 7 



(43) 

We can choose e = Nx5*/(2N ) = Nx/(2N Q (d + 2)), 77 = l/(3(d + 2)) and obtain 
that if N a = o(A 2 ) then (10) is valid with the probability specified in (12). 



9.3.2. Proof of Theorem 4.2 

We first prove that there exists a constant 71 > such that w.o.p. Li is the best l p 
subspace in B(Li, 71). We arbitrarily choose L £ G(D, d) such that dist(L, Li) = 1 
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and parameterize a geodesic line from Li to L by a function L: [0,1] — s- G(D,d), 
where L(0) = Li and L(l) = L. We then observe that there exists 71 > such 
that the function e; p (X, L(i)): [0,1] — > R of (2) has a positive derivative w.o.p. at any 
t G [0,71], that is, 



d 



E xg ^dist(x,Lft))? 
N 



> for all t G [0,71] w.o.p. 



(44) 



We will deduce (44) from the following two equations: 
d f£ xe * dist (x,L(i))P 



dtP 



N 



> 72 w.o.p. for some 72 > 0. (45) 



t=o 



and 



_d_ / E xg ^dist(x,Lft))P 
dtP V N 

Vt e [0,71] w.o.p. 



t=o 



J_ ^E xe ^dist(x,L(t))P 
dtP 



N 



< 



72 
2 ' 



(46) 



When < p < 1, equation (45) follows from (34) and Hoeffding's inequality. When 
p = 1, equation (45) practically follows from the proof of Theorem 4.1 by arbitrarily 
fixing e and 77 such that eao/ai + r\ + 72/0:1 < 5* and noting that when sampling 
from the mixture measure specified in the current theorem (unlike Theorem 4.1) the 
ratio of sampled outliers to inliers, Nq/Ni, goes w.o.p. to oto/ai. We also observe that 
72 = 7(0:0,01, d). 

We first verify (46) for the sum of elements in X\ = X n Li. In view of (26) and 
(34), for any x 6 X\ the single term in that sum (i.e., dist(x, L(t)) p ) has a bounded 
second derivative with respect to t; hence, we can find constants 71 and 72 satisfying 



d 



Exe* dist(x,L(i))P 



N 



d 



Exe* dist(x,L(t))P 



< 



t = t 



72 



(47) 



Vt E [0,71]. 



We derive a similar estimate by replacing the summation ofx G A"i by the summa- 
tion of x G X \ X\. Using the constant 73, which we clarify later, we separate the latter 
sum into two components: X := {x G X \ X\ : dist(x, Li) < 2 73} and (X \ X\) \ X. 

In order to deal with the first sum, we define 



74 



/x(x : < dist(x, Li) < 273) 



where we note that we can choose 73 = 73(1?, 72) = 73 d, cto, a\) sufficiently 
small such that 74 = 74 (d, «o, ai) < 72/24. We use 74 to bound the ratio of sampled 
points from X and X as follows: 



(48) 
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Indeed, we note that #(X) = Exgat -^?( x )' ^(^( x )) = /^( x : x e A") = 74 and 
/^(x) takes values in [0, 1], therefore by applying Hoeffding's inequality to Ig{x), 
where x £ X, we conclude (48). 

Now, the derivative expressed in (23) takes values in [—1, 1] for any y,; £ X. Thus, 
by combining this observation with (48) we obtain that for any to £ [0, 71]: 



d 
~dtP 

w.o.p. 



E xg ^dist(x,L(Q)P 
N 



d 
~dtP 



E xg;e dist(x,L(t))* 
N 



< 4 74 < ? 



(49) 



Differentiating (23) one more time, we obtain that for every x £ (Af \ #1) \ X, 
the second derivative of dist(x, L(t)) is bounded by C(d)/^. Thus we can choose 
7i = 71(72, 73, d) = 7i(«0 7 «i 7 rf 7 -D) sufficiently small such that for any to 6 [0, 71]: 



d E 



dist(x,L(i))P 



dtp 



N 



d E 



xe(*\*i)\* 



e dist(x,L(t)) p 



dtP 



AT 



< 



72 
6 ' 
(50) 

Equation (46) and consequently (44) are thus verified by combing (47), (49) and (50). 
That is, we showed that Li is the best l p subspace in B(Li, 71) for sufficiently small 
7i- 

At last, we will show that for all L £ G(D,d) \ B(Li,7i) and any fixed p < 1, 
there exists some 77 > such that 

ei p (X,L)-e lp (X,Lx)>j 7 N, w.o.p. (51) 

Indeed, we first conclude from Lemma 9. 1 (applied with K = 1) that 

E» K(x,L)) - Ep (e, p (x,Li)) > a (E^ (e ip (x,L)) - (e lp (x,U))) (52) 

+ <*i (%(x,L)) -E m (ei^M))) > 



V 

"i7i 



Setting 77 = ai7 P /(2 2+p (i^) and combining (52) with Hoeffding's inequality, we 
obtain (51). 

Now, (51) extends for a small neighborhood of L. That is, for any L £ G(D, d) 
we can find a ball B(L, t) for some i > such that w.o.p. the subspace Li is a better 
lp subspace than any of the subspaces in that ball. By covering the compact space 
G(D,d) \ B(Li,7i) with finite number of such balls we obtain that w.o.p. Li is the 
best l p subspace in G(D, d) \ B(Li, 71). Combining this observation with the first part 
of the proof, we conclude that w.o.p. Li is the best l p subspace in G(D, d). 

□ 



9.4. Theory Presented in the Introduction (Section 1) 

9.4.1. Proof of Theorem 1.1 

Several ideas of this proof have already appeared when verifying Theorem 4.2. We will 
thus maintain the same notation, in particular for denoting similar constants. 



G. Lerman and T. Zhang/Probabilistic Recovery of Multiple Subspaces 



25 



As in proving the latter theorem, we will first prove the theorem locally. That is, 
we will show that w.o.p. Li is a best l p subspace in the ball B(Lx, 71), where 71 is a 
sufficiently small constant. 

In order to do so, we arbitrarily fix L £ G(D, d) such that dist(L, Li) = 1 (so that 
C £ NS+(d)) and parameterize a geodesic line from Li to L by a function L: [0,1] 
— > G(D,d), where L(0) = Li and L(l) = L. We will then estimate the probability 
that for any such L the function ei (X,L(t)): [0,1] — > R has a positive derivative at 
any t £ (0, 71), that is 



_d_ / ExgArdist(x,L(t))P 
dtP V N 



> for all f £ (0,71). (53) 



First of all, we estimate the probability that the LHS of (53) is larger than some 
constant 72 > at t = 0. When < p < 1, it follows from (34) and Hoeffding's 
inequality that this probability is overwhelming. When p = 1, it follows from (10) that 
this probability is the same as the probability of the event 

Ex^Jcvp^wimIcvBl^ji > ^ (54) 

VC £ NS+(d) andV £ O(d). 
We notice that for all C £ NS+(d) and V £ O(d): 

||CVB Lli *yeJ* = ||CV £ P Ll (x)Pi(x) T /dist(x,L 1 )||, 

< ]T \\cvp Li ( x )p^(k) t /\\p^)\\\U< J2 Hcv^WII- 

xex\X! xex\X! 

Consequently, in order to estimate the probability of (54) it is sufficient to estimate the 
probability that 

— — > 72 VC £ NS+(d) and V £ O(d). 

(55) 

We arbitrarily fix C £ NS+(d), V £ O(d) and verify (55) by Hoeffding's 
inequality in the following way. We define the random variable J(x) = (2/(x £ 
Xi) - l)||C V P Ll (x)|| and note that 

wn( « p ( T.^ Xl HCoVoP Ll (x)|| -Ex Wl ||CoV P Ll (x)|| ^ 
P Al (J(x)) = E^n I — I (56) 



: 1 1 C V P Ll (x) 1 1 - a E w | |C V P Ll (x) 1 1 

K 

^^-^.HCoVoPl^x)!! > ax^JlCoVoPiJx) 

i=2 
K 

^a.P^HCoVoPL^x)!! = ^JICoVoPl^x) 

3=2 
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where O = a± - J2f=2 a r 

Now, let 72 := (3oE fll ||CoVoPli (x)| |/4, so that the random variable J(x) has ex- 
pectation larger than 472 while taking values in [—1, 1]; thus by Hoeffding's inequality: 

Sxe* I|CoVqP Li (x)|| - E xWl ||C VoP Ll (x)|| 
— i > 2 72 (57) 

w.p. > 1 - cxp(-2iV7|). 

We have thus proved that (54) is valid with sufficiently high probability for fixed 
matrices Co £ NS+(e?) and Vo £ 0(d). Next we estimate the probability of (54) for 
all matrices C £ NS_|_(d) and V £ 0(d), when restricted to a ball with sufficiently 
small radius. We let 

dist((C 1 ,V 1 ),(C 2 ,V 2 )) :=max(||Ci-C 2 || 2 ,||Vi- V 2 || 2 ) (58) 

and note that whenever dist((Ci, Vi), (C 2 , V3)) < 72/2 and x £ B(0, 1) we have 
that 

||C 1 V 1 P Ll (x)||-||C 2 V2P Ll (x)|| 

= (||CxVxP Ll (x)|| - HCaVi^CxJH) + (HCaVii^Cx)!! - ||C 2 V 2 fUx)ll) 
< ||d - C 2 ||a + ||C 2 ||a||Vi - V 2 ||a < 72- (59) 

Combining (57) and (59) we obtain that for any ball in G(D,d) of radius 72/2 and 
center (C , V ): 

Exe^ l|CVP Ll (x)|| - E xWi ||CVP Ll (x)|| 

— s > 72 w.p. > 1 - cxp(-2N^). 

(60) 

We easily extend (60) for all pairs of matrices (C, V) in the compact space NS+ (d) x 
0(d) (with the distance specified in (58)). Indeed, it follows from [40] together with 
some basic estimates that the latter space can be covered by £>^( d+1 )/ 2 /(j 2 /2) d ^ d+1 ^ 2 
balls of radius 72/2. Therefore, 

(54) is valid for any C £ NS+(d) and V £ 0(d) (61) 
w.p. 1 - C 1 2d exp(-2iV72 2 )/(72/2) 2d - 1 . 

Equation (53) follows w.o.p. from (54) in exactly the same way of deriving (44) 
from (45) and (46). We remark that (46), which is deterministic, easily extends to the 
current case. While we did not estimate the overwhelming probability for (44), it is 
easy to show that in the current case, (54) implies (53) w.p. 1 — cxp(— Njs)/^. Car- 
rying this analysis, one notices that both 71 and 78 depend on p, d, K, 010, a\ and 
miii2<i<A'(dist(Li, Lj)). Combining this with (61), we obtain that 

Li is a best l p subspace in B(L X , 71) w.p. 1 - C 1 2d cxp(-2^7|)/(7 2 /2) 2 ^ 1 (62) 
- exp(-7V7 4 )/7 4 . 
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We have just proved that Li is a best l p subspace w.o.p. in B(Li, 71). We now extend 
this result to subspaces in G(D, d) \ B(Li, 71). Applying Lemma 9.3 we obtain that 

E m (dist(x, L) p - dist(x, Li) p ) + (dist(x, L) p - dist(x, Li) p ) > (63) 
V 2 < i < K. 

Further application of Lemma 9.1 withL E G(D, d) \B(Li, 71), results in the inequal- 
ity: 

i^(dist(x,Lf)> — % (64) 
Now, combining (63) and (64) we have that 
^(dist(x,L) p -dist(x,Li) p ) 

K 

= ( di8t ( X > L ) P - dist ( X - L l) P ) + E K ( di8t ( X ' L ) P - dist ( X > L l) P )) 

i=2 

+ (3 E^ (dist(x, L)P - dist(x, Uf) 
>0 + foE^ (dist(x, L)P) > 79 = ~^P±w, 

where 79 depends on d, K, p, ao, a\ and min2<i<Ar(dist(Li, Lj)). Noting further that 
dist(x, L) — dist(x, Li) takes values in [—1, 1] and applying Hoeffding's inequality we 
obtain that for any L e G(D, d) \ B(L X , 71): 

e lp (X,L) - e lp (X,U) > lvN/2 w.p. > 1 - cxp(-7V 79 2 /8) . (65) 

By Lemma 9.2 we obtain that for any L' e G(D, d) satisfying dist(L, L') < 
(79/4)? and any x G B(0, 1): 

|dist(x, L') p - dist(x, L) p | < 79 /4. 

Consequently, for any L G G(D, d) \ B(Li, 71) and all V G B(L, (79/4) p ): 

e lp {X,L')~e lp (X,U)>0 w.p. > 1 - cxp(-iV7 9 2 /8) . (66) 



Following [Remark8.4][39]wecancoverG(Li,d)\B(Li,7i)byC2 (D d) /% 



d(n-d) 



'9 

balls of radius e. Now, for each such ball we have that (65) is valid for its center 
w.p. 1 — cxp(— Njg j 8) and consequently (66) is valid for subspaces in that ball with the 
same probability. We thus conclude that (66) is valid for all L' G G(D, d) \ B(Li , 71) 
w.p. 1 - exp(-iV7|/8)C2 (D " d)/p /79 (£, " d)/p . Combining this with (62), we obtain 
that the probability that Li is a best l p subspace in G(D, d) is 

l-C^cxpMA^)/^^) 2 ^ 1 ^ 

or equivalently, 1 — C exp(— N j C) for some C depending on D, d, K, p, ao, ot\ and 
iriin 2 <j<if (dist(Li,Li)). 

□ 
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9.4.2. Proof of Theorem 1.2 

We will prove the theorem for p = 1 only, since its extension to < p < 1 follows 
from the proof of Theorem 4.2. Other parts of the proof will also be shortened due to 
their similarity to the proofs of Theorems 1.1 and 4.2. 

Throughout the proof we view the energy ei 1 (X, Li, L 2 , • • • ,Lk) as a function 
defined on G(D,d) K while being conditioned on the fixed data set X. On the other 
hand we view ei x (x, Li, L2, • • • , Lk) as a function on R D xG(D, d) K . We distinguish 
elements in G(D, d) K by the norm on the product space, that is 

dist((Li,L 2 , ■ • • ,L K ), (Li,L 2 , ■ ■ ■ ,L K j) = max (dist(Lj, L,)). (67) 

We note that it is enough to prove that the set {Li, L 2 , • • ■ , Lk} minimizes w.o.p. the 
energy e; x (X UX 1 ,L 1 ,L 2 , ■ ■ ■ , Lk), where {Xi}fL Q have been defined in Section 2.1. 
Indeed, it follows from the immediate observation: 



K 



1 I Li> L 2 , • • ■ , Lk ) — 0. 



We will first show that there exists a constant 71 > such that the set {Li, L 2 , 
■ ■ • , Lk} is a minimizer w.o.p. of (Xq U X\, Li, L 2 , • ■ ■ , Lk) in B((Li, L 2 , • ■ ■ , 
Lk ), 7i )■ In order to simplify notation in this part of the proof, we will adopt WLOG 
the convention that the RHS of (67) occurs at i = 1, i.e., 

dist(Li,Li)= max (dist(L i; Li)). (68) 

For all 1 < i < k, we parameterize the geodesic lines from Li to Li, where 
dist ( (Li,L 2 , • • ■ ,L K ), (Li,L 2 , • • • ,L K )) = 1, 



by functions Li{t) on the interval [0, dist(Li, Li)] such that Li(0) = L; and L,(dist 
(Li, Li)) = Lj. Applying Lemma 9.2 and assuming j = argmin 1<i<x dist(x, Li), we 
derive the following estimate: 



■^(e, 1 (x,L 1 (t),L 2 (t),.-- ,L K (t))) 



= lim 



dist(x,Lj(t)) -dist(x,L,-(0)) 



> _|| X || ^ dist(L j ft),L 3 (0)) = _,| x| | dist(Lj(1);Lj(0)) > _ w . (69) 

Combining (69) with Hoeffding's inequality, we obtain that 

d 



(e^cM^L^),--- ,L*(t))) 
at 



> - V] ||x|| > -aoA^ w.o.p. (70) 

4=0 x G * 



Now, following the arguments of the proof of (27), we conclude the equality: 
d 



MXuUWMV),-- Mt))) 

at 



= ]T ||CVP Ll (x)|| w.o.p., (71) 
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where C S NS+(d) and V £ 0(d) as in the latter proof. Thus, by Hoeffding's in- 
equality, there exists Ai = Xi(d) > such that 



■^ t (eMM(t)M(t),--- ,M*))) 



> aiXiN w.o.p. (72) 

t=o 



Using this constant Ai, we set 



v a := min ( Ai, — — r ) . (73) 
V AKd? J 

It follows from (4) and (73) that «iAi — a > 0. Now, combining (70) and (72) we 
obtain that there exists a constant < 72 < «iAi — oq such that 



(e h (X U Afi, Li(t), L 2 (t), • • ■ , Ljr(i))) 



> 72^V w.o.p. 

t=o 



We use the arguments of the proof of (44) to conclude that there exists a constant 
71 > such that 

■^(ei^AbUATi.LiCtJ.LaCt),"- ,L K (t))) >0 for all <<< 7l w.o.p. (74) 

Consequently, {Li,L2, • • • ,Lk} is a minimizer w.o.p. of in the ball i?((Li,L2, 
• • • , Lif ), 71). Since ei x is symmetric on G(D, d) K , it is also the minimizer w.o.p. of 
ejj in Ui lj i a> ... ,i K £V K B((L«i , Lj 2 , • • • , L iff ), 71), where "P^ is the set of all permuta- 
tions of (1,2, • • • ,K). 

Next, we note that the set {Li, L2, ■ ■ ■ , L^} is also a global minimizer outside this 
ball, that is, for any 

(Li, L2, • ■ • , Ljr) G GP(.D, d, 71) (75) 
:= G(D,d) K \ Ui lt i 2 ... ,i K ev K B ((L n ,L i2 , ■ ■ ■ ,L,- K ),7i) : 



e ;i (A", Li, L 2 , • • • , W ) - e ;i (A", Li, L 2 , • • • , L«-) > C 2 N w.o.p. (76) 

Indeed, (76) follows by choosing 71 < do (where do was defined in (20)) and combin- 
ing (73), Hoeffding's inequality, Lemma 9.4 and the assumption specified in (4). 

In order to conclude the theorem we extend (76) w.o.p. for all K subspaces in the 
set GP(-D,d,7i) defined in (75) (and not for a fixed subspace in that set). This is 
done as in the proof of Theorem 4.2 by covering GP(D, d, 71) with balls and simi- 
larly concluding that Li, L 2 , • • ■ , Lk and any of its permutations minimizes (Xq U 
A?i,Li,L 2 , • • ■ ,Lk) and consequently (X, Li,L 2 , ■■ • ,Lk) w.o.p. 

□ 



9.5. Theory of Section 5 

9.5.1. Reduction of Theorem 5.1 and Theorem 5.2 

We first explain how to reduce the proof of Theorem 5. 1 when < p < 1 to the veri- 
fication of a simpler statement. We then adapt this idea for proving the same theorem 
when both p > 1 and K = 1, as well as for proving Theorem 5.2. 
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In order to prove Theorem 5.1 when < p < 1, i.e., prove that the global minimum 
of ei p (X,L) is in B(Li, /) w.o.p., we only need to show that there exists a constant 
7i > such that for any L ^ B(Li, /): 

£ Me (e ip (x,L)) >E flc (e lp (x,L 1 ))+~/i- (77) 

Indeed, we cover the compact space G(D, g0\B(Li, /) by small balls with radius 71 /2. 
Then by using (77) and Hoeffding's inequality, we obtain that e; (X, L) > e/ (X, Li) 
for any L in each such ball w.o.p. Therefore, e/ (A^L) > e/ (A',Li)forL G G(D,<i)\ 
B(Li, /) w.o.p. Equivalently, G(D, d)\B(Li, /) does not contain the global minimum 
of ei p (X,L) w.o.p. 

For i = 1, . . . , K, let be the measure obtained by projecting fj,i >e onto its corre- 
sponding subspace L< (that is, for any set i? C B(0, l)nLj: jli^{E) = fj,i t£ (P^ 1 (E))). 

We also let /2 e := ao/Uo + J^ii a; iAi,e(-E')- By the triangle inequality: 

|£ M .(e Ip (x,L))-£; A .(e Ip (x ) L))|<eP. 

Hence, in order to prove (77) and thus Theorem 5.1 for p < 1, the following equation 
is sufficient: 

£ Ac ( % (x, L)) > (e, p (x, Li)) + 71 + 2eP, for any L G G(X>, d) \ B(L X> /). 

(78) 

Similarly, we reduce Theorem 5.1 when K = 1 and p > 1 to the following condi- 
tion: 

£ Ae ( % (x,L)) >^(e ip (x,L 1 ))+ 7l + 2 J5 e, for any L G G(D, d) \ B(L 1; /). 

(79) 

Indeed, we note that for any Xi,X2 G B(0, 1) with dist(xi,X2) < rj < 1 and any 
£i,£ 2 G G(D,d) with dist(L x , £ 2 ) < 77: 

dist(xi, Lif - dist(x 2 , Lif < 1 - (1 - r)) p < pq, (80) 

and 

dist(xi, Li) p - dist(xi, l 2 ) p < 1 - (1 - n) p < pr\. (81) 

Whenp = 1, (80) follows from the triangle inequality and (81) follows from Lemma 9.2, 
whereas both equations extend to p > 1 by the following property of the p-th power: if 
< yi, U2 < 1, Vi - V2 < V and p > 1, then y{ - y\ < 1 - (1 - 77)?. 

Following a similar argument, we reduce the verification of Theorem 5.2 to proving 
that for all permutations ii, i 2 , • • ■ ,ik G Vk with dist((L, 1 , L^ 2 , • • • , Li K ), (£1, £2, • • ■ 
,|!Lk))>/: 

£7 p .(e l|> (x > Li,L 2 ,--. ,L X )) > £? Pe (e, J> (x,L 1 ,L 2l • • • , L K )) + 7l + 2e p . (82) 
We conclude with the proofs of (78), (79) and (82). 
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9.5.2. Proof of (78) and (79) and conclusion of Theorem 5.1 

We arbitrarily fix L 6 G(D, d) \ B(Li, /). We assume first that < p < 1 and apply 
Lemma 9.3 to obtain that 

^A s -(«i-Ef =2 Ql )Ai, s %( x > L ) - £ fe-(«i-E£ 2 ^)Ai, e %( x > Ll ) 
A' 

i=2 

Consequently, we prove (78) with 71 := 2e p as follows: 

E Al (e,,(x,L))-.E Al (e,,(x,L 1 )) > L-^ ai j £ Ali6 (e, p (x,L)) (83) 

(«i-Ef= 2 ^) / p 

> - - = 4e p 

where the second inequality applies Lemma 9.1. 

Equation (79) follows from the same argument of (83), where e p is now replaced by 

pe. 

Proof of (82) and conclusion of Theorem 5.2 

In view of Lemma 9.4 it is sufficient to prove that 

t min a, - a ) f p > 71 + 2e p (84) 

l<j<K J J 

and 

t min otj min dist p (L;, L,-)/2 p - a > 7i + 2e p . (85) 
i^i^^ i=*j=^" 

Setting 7l = e?, (84) follows from (17) and (85) follows from (16). 

□ 

9.6. Theory of Section 6 

9.6.1. Reduction of Theorem 6. 1 

We explain how to reduce Theorem 6.1. We use the same notation of Section 9.5.1, in 
particular, /i e . 

Theorem 6.1 states that the best l p subspace is not in B(Li, kq) w.o.p. for almost 
every {Li}*^ <S G(D, d) K . We claim that it reduces to the following simple equation: 

Ida ({MEi C G(D, d) : L x = argniin^ (e lp (x, L))) = 0. (86) 
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Indeed, if (86) is not satisfied, then for any K d-subspaces {Li}fL 1 in a subset of 
G(D, d) K with nonzero 7^ d measure there exists Lo £ G(D, d) such that 

7i :=S A .(e, 1( (x,Li))- J E7 p .(e Ip (x J Lo)) > 0. 

Letting S = k = ji/4pe, we obtain from (80) and (81) that for any L* £ B(Li, «o): 

(e lp (x, L*)) - (e lp (x, L )) > E^ (e lp (x, L*)) — E K (e lp (x, L )) - 2S Q p 

7i 

> ^«( e i P ( x ' L i)) - E fi c ( e i P ( x ^ L o)) - 2S p- K p = —. 
Therefore, by Hoeffding's inequality: 

e lp (X,L*)~e lp (X,L ) > \~ w.o.p. 

In order to have 

e lp (X, L*) — e lp (X, L ) > for all L* £ B(L X , «o) w.o.p., 

we cover B(Li, /to) by small balls with radius 71/I6, so that e; (A", L) > e; (A, Lo) 
for all L in each such ball w.o.p. Therefore, e; (A?,L) > e; (A", Lo) for all L 6 
B(Li, Kq) w.o.p. Equivalently, B(Li, kq) will not contain the global minimum of e; p (X, L) 
w.o.p. This contradicts Theorem 6. 1 and therefore (86) implies this theorem. 

9.6.2. Proof of (86) and conclusion of Theorem 6.1 
In view of Proposition 3.2, we only need to prove that 

Ida {{U}ti C G(D, d) : E^ (D Ll , x , p ) = 0) = 0, (87) 
where , x , p is the operator defined in (6). Using the notation 

/i(Li,L,) = £ Am (D Ll , x ,p) , 2 < i < K, 
we rewrite (87) as follows: 

Ida {{U}?i C G(D,d) : E h (D Ll , x , p ) = 0) 
=1d 4 C G(D,d) : S Ef=aaiA . e (D Ll , x , p ) = 0) 

=1da [{U}ti C G(£>, d) : 2 a, ?i(Li,Lj) = 0^=0. (88) 

Since {Lj}^ are independently distributed according to jda, Fubini's theorem im- 
plies that (88) follows from the equation: 

Ida (L a G G(D, d) : fr(L x , L 2 ) = C(L 1; L 3 , • ■ ■ , L*)) = 0, (89) 

where C(Li,L 3 , •• • ,L K ) = -^2 i=3 a i h(L 1 ,L i )/a 2 . 
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We follow by proving (89) and consequently concluding (86). We denote the prin- 
cipal angles between L2 and Li by {9j}j =1 , the principal vectors of L2 and Lj by 
{vj}j = i and {vj}^ =1 respectively and the complementary orthogonal system for L2 
w.r.t. Li by {uj}^ =1 . Note that as an operator, /i(Li, Lg) maps Sp({ui}^ =1 ) to Sp({vj}^ =1 ). 
Now, transforming x E L2 n -B(0, 1) to {ai}f =1 in a d-dimensional unit ball by 
x = Yli=i a iV;, we have that for any 1 < ii, £2 < rf: 

v^(L 1; L 2 )u 42 = £ M2 (v l T i P Ll (x)P L J - l (x) T u i2 dist(x,L 1 f- 2 ) 

= / cos^a^ sin0j 2 aj 2 } a 2 sin 2 0j dV, 

where dF denotes the scaled volume element on the d-dimensional ball $^ i=1 aj 2 < 1. 
When «i ^ i-2, the function 



3 f i ± ttjj sin Wi 2 ai 2 I a 2 sin 2 ' 



= 0. 



/ cos 9j sin 6j a 2 > a 2 sin 2 #i , j = 1, • ■ ■ , d. 



is odd w.r.t. and consequently 

vf h(Li, L 2 )uj, = / cos 6L a,, sin aj 2 V a 2 sin 2 #i 

Therefore, when we form V and U as in (25), the d x d matrix 

VE, 2 (P Ll (x)P^ (x) T dist(x, Lx)f" 2 )U T 
is diagonal with the elements 

p-2 

(d \ ~ 

i=l / 

We denote 

>-= 

/ d 

A,(/i(Li,L 2 )) = / cos sin #0 a 2 ya 2 sin 2 

where we note that {Ai(/i(Li, L2))}f =1 are the singular values of ft,(Li,L2). We ar- 
bitrarily fix Li, L3, L4, Lk and denote the singular values of C = C(Li,L3, 
L4, • • • , Lk) by {<Ji}fLi an d n °te that (89) is implied by the following equation: 

7fl,d(L 2 G G{D,d) : Ai(MLi,L a )) G {a i }g 1 )=0, (90) 

which we express as: 

d P7 2 \ 

I J cos 0i sin #ia 2 ^a 2 sin 2 ^j dyeW^J (91) 



, j = 1, •■■ ,d, 
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= 0. 



We first conclude (91) when p = 2. In this case 

/ cos#isin0ia^ y^a^sin 2 ^ = / cos d\ sin 6iaf 

J-ZLi*! 2 ^ Vfci / ^E- =1 ai 2 <i 

(92) 

is a monotone function of 6\ on [0, 7r/4] as well as [tt/4, it/2]. That is, the requirement 
that Ai(/i(Li, L2)) <E {cillLi can occur only at discrete values of B\ and consequently 
has jo,d measure 0, that is, (91) (and consequently (86)) is verified in this case. 
lip ^ 2 and {0 t }fl^ are fixed, then 



p-2 
2 

(93) 



/ cos Q\ sin 81 a\ > of sin 2 6i 

is a monotone function of 6^. Following a similar argument we conclude that 

7D,d (h(L u L 2 ) e {^}f =1 |{^}ti) = 0. (94) 

Combining (94) and Fubini's theorem, we conclude (91) (and consequently (86)) in 
this case. 



9.6.3. First Reduction of Theorem 6.2 

Following the argument of Section 9.6.1 , we note that Theorem 6.2 will follow by 
proving the following equation: 

7zU{M£i cG(D,d) : (Li,L 2 , ■ • • ,L K ) 

= argmin (£i ^ 2i ... L K) £ , A c ( e ;p( x 'Li,L2, ■ ■ ■ ,Ljf)))=0. (95) 

We further reduce (95) by using the operator Dl, x ,p of (6) and the regions {Yi}fL 1 , 
which are obtained by a Voronoi diagram (restricted to the unit ball) of the rf-subspaces 
{U}f =1 C G{D, d) as follows: 

Yj = Yj(Li,L 2 , ■ ■ • , L^) 

= {x £ B(0, 1) : dist(x, L 4 ) < dist(x, L,-) Vj : 1 < j 7^ i < /v }. (96) 

We will show that the following equation implies (95) and thus Theorem 6.2: 

TD,i(M«cG(fl,d): (97) 
^ (7(x G Y,(L 1; L 2 , • • • , L K )) D L ., X)P ) = V 1 < j < A') = 0. 

We first show that if the condition 

(Li,L 2 ,--- ,L K ) = argmin ( L i fj2 ... fj/f) A Ae (e/ p (x,Li,L 2 , • • • ,L K )) 
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of (95) is true, then 

Li = argmin LeG(Ad) A As (e ip (x,L)/(x G Yi)) . (98) 
Indeed, for any Li G G(D, d), we have that 

< £? 0e (ei p (x,Li,L2, ■ ■ • ,L K )) - (e ip (x, Li, L 2 , ■ • • ,L K )) 
<%(/(x6Y 1 )e lp (x,L 1 ))+ ]T B A[ (7(xeY i )e, p (x ) L j )) 

2<(<if 

- ]T A Ac (/(xG Y ? ;) % (X,L,)) 
l<i<-ft" 

= £fc.(/(x g Yi)e, p (x,Li)) - £fc.(J(x G Yi)e, p (x,Li)). 
Therefore, (95) immediately follows from the equation: 

Ida ({Li}£iCG(A<0: 

Iy = argmin LeG(D>d) £; Ae (e ip (x, L)/(x G Y,)) V 1 < j < A) = 0. (99) 

Proposition 3.2 can be extended to probability measures instead of data sets, where 
its sum is replaced by an expectation (indeed, this is inferred by the proof of Proposi- 
tion 3.2). In view of this modified version, (99) implies the following equation: 

Ida : E^(I(x e Y,iD,. . x .,; = V 1 < j < K) = 0. (100) 

At last we easily note that (100) is equivalent with (97) and thus a proof of (97) will 
conclude Theorem 6.2. 

9.6.4. Second Reduction of Theorem 6.2 

In order to motivate a further reduction of (97), we formulate a proposition demon- 
strating under some conditions the sensitivity of the region Yi (or WLOG any of the 
regions {Y^}^) to perturbations in the subspace L 2 (or WLOG any of the other sub- 
spaces but Li, when fixing Yi). This proposition uses the following notation, which 
will also be used throughout the rest of Section 9.6: d* = min(e?, D — d), Qd* (Lj, L^) 
is the eP-th largest principal angle between the d-subspaces Li and Lj, Ck is the 
fc-th dimensional Lebesgue measure, a V b denotes the maximum of a and b and 
for L2,Li,L2,- - , Lk G G(D,d) and 1 < i < K, we use the notation: Y,; = 
Yi(Li,L 2 , • • • ,Lk), Y; = Y ? (Li,L 2 ,L 3 , • • • ,L K ), Y> = Y ?; (Li,L 2 , • • • ,L K ) and 
Yj = Yj(Li, L 2 , L3, • • • , Lk), where 7 is used to denote closure, e.g., 

% = {x G B(0, 1) : dist(x, Li) < dist(x, Lj) Vj:l<j^=i< K}, (101) 

that is, Yj is the closure of Yj. Furthermore for L 2 , Li, L 2 , ■■ ■ , Lk G G(D,d) and 1 < 
i < A', we use the notation: Y; = Yj(Li,L 2 , • • ■ , Lk), Yj = Yj(Li,L 2 ,L3, • • • , Lk), 
Yj = Yj(Li,L 2 , • • • ,L K ) and Yj = Yj(Li,L 2 ,L 3 , • • • ,L K ). 

Given this notation we formulate the proposition mentioned above as follows. 
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Lemma 9.5. //X2, Li, L2, ■ ■ • , Lk ore subspaces in G(D, d) such that L2 L 



2. 



mm(9 d ,(L 2 ,L J ))>0, min (9 d . (U, Lj)) > 0, (102) 
and6 d .(L2,Li)VOd*(UM)< min d .(Li,Li), (103) 

3<i<K 

then 

Cd ((Yi \ Yi) U (Yi \ Yi)^ > 0. (104) 

Using the conditions of Lemma 9.5 regarding the subspaces {Lj}^ 1; we can re- 
duce (97) to the following equation: 



lD,d L2 € G(D,d) : min 6> d . (L i; Lj) > 0, argmin d . (Li, L. t ) = 2, 

V l<i¥=3<K 2<i<K 

E (Uo (J(xGY 1 )D Ll)X)P ) = 0) = 0. (105) 

Indeed, we first note that WLOG (105) can be equivalently formulated by replacing 
L2 with any L&, k = 3, . . . , K, and letting arg min 2<i <^ d - (Li, Lj) = k respec- 
tively. Using this observation and elementary properties of measures we have that 

7^({L ! }f = i^ w a(xeY j (L 1 ,L 2 ,.. , L K )) D L ., X)P ) = VI < j < K) 
<y~] / lD.d\L k : min 6 d *{Li,Lj) > 0, argmin 6 d *(L\,Li) = k, 

^2JG(fl,ii) K - 1 V l<i^]<K 2<i<K 

£^ (/(x e Yi)D Lll x, P ) = I {Li}i<^fc<jc) • d (^- d {{Li}r<i^ k <K) 



+ lE, d {U}gi : min 6 d *(L h Lj) = 0=0. 

\ 1<i,j<K J 

It is thus sufficient to prove (105) in order to conclude Theorem 6.2. We end this section 
by proving Lemma 9.5. 

Proof of Lemma 9.5. We will first show that there exists xo G 5(0, 1) such that 

dist(x , Li) = dist(x , L 2 ) < min distfxo, LA (106) 

3<i<K 

We will then prove that (106) implies (104). 

We verify (106) in the two cases: d* = d and d* = D — d and consequently 
conclude the lemma. For both cases, we denote the principal vectors of L2 and Li 
by {vi}f =1 and {vi}f =1 respectively. We assume first that d* = d and denote xo = 
Vd* + v d . / 1 j v d , + v d . I j . Equation (102) implies that the intersections of the c?-subspaces 
{Li}fL 1 are empty. Applying this observation as well as elementary geometric esti- 
mates we obtain that for any io > 3 and any unit vector vo € L, Q : 

ang(x ,v ) > ang(v d ., v ) - ang(v d »,x ) > 6 d * (L io , L x ) - 6 d * (L 2 , L{]/2 

> 9 d , (La, L x ) - 6 d , (L 2 , L-0/2 = d . (L 2 , LO/2. (107) 
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We claim that (107) cannot be an equality. Indeed, if the first inequality in (107) 
holds, then Vq, v^* and Xo are on a geodesic line within the sphere S ^ 1 . Combin- 
ing this with the assumption that (107) is an equality, we obtain that ang(xo,vo) = 
Qd* (L2, Li)/2 = ang(xo,Vd*) = ang(xo, Vd* ). We thus conclude that either vq = v c ;» 
or vo = Vd* , which contradict ( 102) as well as the definition of principal vectors. Using 
the concluded strict inequality in (107), we prove (106) in this case as follows: 

dist(x ,L lo ) > ang(x ,v ) > d .(L 2 ,Li)/2 = dist(xo,Li) = dist(x ,L 2 ). 

Ifd* = D— d, thenxi = [y d* +v d») / \{v d* + v^* ||. It follows from basic dimension 
equalities of subspaces and (102) that for all 2 < i < K: dim(Li U Li) = D and 
dim(Li n Li) = 2d — D. We denote by Kq the integer in {0, . . . , K} such that for 
any 3 < i < Kq, Li n L; = Li n L2 and for any i > Kq, Li n L, 7^ Li n L2 
(the existence of Kq may require reordering of the indices of the subspaces {Li}fL 3 ). 
Let X2 be an arbitrarily fixed unit vector in Li n (L 2 \ ^K <i<K^i)- Denote eo = 
dist(x 2 , ^K <i<K^i) and let Xo = x 2 /2 + eo xi/5. We first claim that 

dist(xo, Li) = dist(xo, L2) < min dist(xo, Lf). (108) 

3<j<K 

Indeed, we can remove Li n L2 from the subspaces {Lj}^ and obtain subspaces of 
dimension D — d intersecting each other at the origin. We note that (108) is equivalent 
to a similar equations for the new subspaces, which replaces xo with xj.. The latter 
follows from the proof of the case d = d* . 
On the other hand, we note that 

dist(x ,Li) = e dist(xi,Li)/5 < e /5 < dist(x 2 /2, U Ko<j < K Lj) - e /5 

< dist(x 2 /2 + e Xi/5,Uit <j< K -L_,-) = min dist(x ,Lj). (109) 

Ko<i<K 

Equation (106) thus follows by combining (108) and (109). 
We note that (106) implies that 

x € Yi U Y 2 U (Yi n Y 2 ) and x € Yi U Y 2 U {% n %). (110) 

The continuity of the distance function implies a stronger version of (110): There exists 
e > such that 

£(x 0j e) C Yi U Y 2 U (% n Y 2 ) and B(x , e) C % U Y 2 U (t x fl %), (HI) 

We will deduce (104) from (1 1 1) by considering two different cases. Assume first that 
YinY 2 n J B(x , e) ^ Y_inY 2 _nB(x , e). Using (111) and the fact that £ D (YinY 2 ) = 

0, we may choose y € Yi n Y2 n B(xo, e) and also y 6 Yi U Y2; WLOG we assume 
instead of the later condition that y E Yi. By perturbing y slightly we can choose 
another point yo such that yo G Y2 and yo G Yi \ Yx. It follows from the continuity 
of the distance function that there exists a small 77 > such that (Yi \ Yi) U (Yx \ Yi) D 
Yx \ Yx D B(y , 77), which proves (104). 
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Next, assume the complimentary case where Yi n Y 2 n B(xq, e) = Yi n Y 2 H 
B(xo, e). We will exclude this case by showing that it leads to either one of the fol- 
lowing contradictions: Li = L 2 and L 2 = L 2 . We note that the solutions sets of 
x o(-Pli - -Pl 2 )x = and x^(P Ll - P £2 )x = in P(x ,e) are B(x ,e) are 

Yi n Y 2 n P(x , e) and Yi n Y 2 n P(x , e) respectively. In view of (11 1), these two 
manifolds (i.e., the solution sets) coincide. Therefore, their (D — 1) -dimensional tan- 
gent spaces at xo, i.e., Xq(P\ j1 — Pl, )xo = Oandx^(PLi — Pl 2 ) x o = 0, also coincide. 
Consequently we have that Xq(Pt j1 — Pl 2 ) = t x o{Pli — Pf, ) f° r some t 7^ 0. Simi- 
larly, for any xi G YinY 2 nP(x , e), we have xf (P Ll -P L2 ) = t± xf (P Ll -Pf, ) for 
some ii 7^ 0. We note that ti = t by the following argument: t\ xJ(P^ 1 — P £a )xq = 

xf(P Ll -Pl 2 )x = txf(P Ll -Pl 2 )x - Therefore, for any Xi G % n Y 2 n B(x , e), 
we have 

xf(P Ll -Pl 2 ) =ixf(P Ll -P £a ). (112) 

Since the tangent space of Yi n Y 2 n P(xo, e) at xo has dimension D — 1, the 
subspace Lo = Sp{Yi n Y 2 n P(xq, e)} has dimension at least D — 1 (it is possible 
to show that its dimension is D, though we have a simpler argument avoiding this 
technical detail). In view of (112), Lo satisfies 

P Lo (P Ll - Pl 2 ) = * Pl„ (P Ll -P U ). (H3) 

Using the fact that (p^ — Pf ) an d (Pli — Pl 2 ) can t> e represented as symmetric 
D x D matrices, we have the following equivalent formulation of (113): 

(P Ll - Pl 2 )Pl = (P Ll - P £a )Pl„ ■ (1 14) 

Furthermore, using the fact that (P^ — Pf ) and (Pl x — Pl 2 ) have trace 0, we obtain 
that 

tr(P L( |(P Ll - P L2 )P Lf |) = - tr(P Lo (P Ll - Pl 2 )Pl ) = -* ■ tr(P Lo (P Ll - Pf 2 )PL ) 
=t-tr(P L x(P Ll -P t2 )P L x). (115) 

Since P h ± is at most one-dimensional (it is actually zero-dimensional, but it will take 
us more time to establish it), (115) can be rewritten as 

P L -(P Ll - Pl 2 )P L( | = t ■ (P L(f (P Ll - P £2 )Pl-)- (116) 

Combining (113), (114) and (116), we obtain that (P Ll - PfJ = t (P Ll - Pl 2 ) and 
P £2 = (l-t)P Ll +tP L2 . 

We conclude the desired contradiction in two different cases. Assume first that t < 1 
and let vo be an arbitrary unit vector in L 2 . We note that v^Pf^vo = 1 as well as 
(1 - t)v n P Ll v = 1 - tv^P^vo > 1 - t. Consequently, v^P^vo = 1, i.e., 
vo G Li and thus we obtain the following contradiction with (102): Li = L 2 . Next, 
assume that t > 1 and as before vo is an arbitrary unit vector in L 2 . In this case, 
v o p l, v o = (!-*) v o PliVo + 1 v^Pl.vo < + = 0. Therefore, v G and we 
obtain the following contradiction with (102): L 2 = L 2 . Equation (104) is thus proved. 

□ 
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9.6.5. Proof of (105) when d=lord=D-l 

We first prove (105) when d = 1. We fix vi to be one of the two unit vectors spanning 
Li and denote by Ui the unit vector spanning (Lj + L 2 ) fl Lf- having orientation such 
that for any point x 6 L2: (x T Ui) (x T Vi) > 0. We will prove that 

jD,d( L 2 '■ min fld*(Li,Lj-) > 0, argmin0 d .(Li,Li) = 2, E Mo (J(x G Yi) D Ll , x ,p) 

v i<t^j<K 2<i<K 

(L 1 +L 2 )nL^ = Sp(u 1 )) =0. (117) 



= 

We first show that (117) implies (105). We define fl = {x G S^ 1 : x _L v} 
(recalling that 5 15-1 is the unit sphere in M. D ) and the measure ui on Oo such that for 
any A C Q Q : w(A) = j D .d( L 2 G G(Z),d) : (Lj + L 2 ) flL^ G Sp(A)). Using this 
notation, (117) implies (105) as follows: 



"/D,d I L 2 G G(D, d) : min 0<j.(Lj,Lj) > 0, argmin0 d *(L,i,Li) = 2, 

V l<»#i<A' 2<i<K 

E /JO (7(xGY 1 )D LliX , p )=0) 
= / 7D,d L2 : min d «(Li, L.,-) > 0, argmia0 d «(Li,L i ) = 2, 

JO V ^<i¥=j<K 2<i<K 

£ Mo (J(x G Yi) D Ll , XiP ) = (Li + L 3 ) nL| = Sp(ui)) d (w(m)) = 0. 



We will prove (117) by showing that at most one element satisfies its underlying 
condition (i.e., is a member of the set for which jD.d is evaluated). Assume on the 
contrary that there are two subspaces L 2 and L 2 satisfying this underlying condition 
with corresponding angles 6 and 9 in [0, 7r/2], where WLOG 6 > 9. 

We define 

Yi = Yi(Li, L 2 , L 3 , • • • , L K ) and Yi = Yi(L X) L 2j L 3 , • • • , L K ). (118) 
Since both L 2 and L 2 satisfy the underlying condition of (117), we have that 

(/(x G Yi \ Yi) D Ll , XiP ) - (/(x G Yi \ Y x ) D LliX , p ) 
=2 • (£„ (/(x G Yi)Dl 1)X>p ) - £ M (/(x G Y x ) D Ll , x , p )) =0-0 = 0. (119) 
Consequently, 



I(x G Yi \ Yi) vf D Ll ,x, P ui) - E^ (/(x G Yi \ Y x ) vf D Ll)X)P ui) = 0. 

(120) 



Defining 

u i ' x 

0ui,vi(x) = arctan 

and 

Y = {x G B(0, 1) : dist(x, Li) < min dist(x, Lj)}, 

3<i<K 
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FIG 1. The case d = 1 and K = 2. 



we express the regions Yi and Yi as follows: 

Yi=Y n{xe B(0, 1) : 6/2 - tt/2 < Ul>Vl (x) < 0/2}, (121) 

Yi = Y n {x G B(0, 1) : 6/2 - tt/2 < 6> Ul , Vl (x) < 6/2}. (122) 

Figure 1 demonstrate those regions and clarifies (121) and (122) in the special case 
where d = 1 and K = 2. 

Combining (121) and (122) with the definition of Dl, x ,p in (6), we obtain that 

Yi\YiC{xeB(0,l): vf xx T ui = dist(x, Li) (2 - p) vf D Ll . x . p ui > 0}, (123) 
and 

Yi \ Yi C {x G B{0, 1) : vf xx T Ui = dist(x, L^'^vf D Ll , x , p ui < 0}. (124) 

It follows from Lemma 9.5 that £((Yl \ Yi) U (Yi \ Yi)) > 0. This observation 
combined with (120), (123) and (124) lead to a contradiction, which proves (117) and 
consequently (105). 

We note that the same proof can be generalize to the case d = D — 1, by fixing vi as 
one of the two unit vectors spanning Li n (Li n L2) Vi (note that dim(Li) = D — 1 
and dim(L! n L 2 ) = d — 2 so that dim(L! n (Li n L 2 ) ± ) = 1) and U! the unit vector 
of (Lj + L 2 ) l~l with a similar orientation as in the case where d = 1. 

9.6.6. Concluding the Proof of (105) and Theorem 6.2 

We conclude the proof of (105) and consequently Theorem 6.2 by considering the case 
not discussed above, i.e., d 7^ 1 and d ^ D — 1. We first reduce (105) to a simpler 
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form using the following notation. For L, L* £ G(D,d), we define the "orthogonal 
subtraction" as follows: 

L*eL = L*n(LnL*)- L . 

We let 

fii ={(Pi,P 2 )6l DxD xIR fl!<D : 3LeG(D,d), not orthogonal to U , 

s.t. dim(Li G L) > 1, P^PlPu = Pi, Pf^PiPf^ = P2} 

and define the measure wi on fii as follows: For any set A C fl 1 

wi(A) = 7 D, d (L £ G(D,d) : (P^PlPu, P Ll T ' PlP^) G A) . 

Using some of this notation we reduce (105) as follows: 

lD,d (L2 : Li is not orthogonal to L 2 , dim(Li n L 2 ) > 1, 

min <V(L;,L,-) > 0, argmin0 d . (L 1; L») = 2, £ M0 (7(x £ Y x ) D Ll , x , p ) = 

K^x^^^l/^^) = (Pi, Pa) G Xo) = 0. (125) 
Indeed, 

7r>.d I L 2 £ G(D,d) : min d » (L if L„) > 0, arg min 6» d » (Li , L ? ) = 2, 

\ l<v£j<K 2<i<K 

^ o (7(xeY 1 )D LljX>p ) = 0) 

< / 7r>.d (L2 : Li is not orthogonal to L2, dim(Li G L 2 ) > 1, 

min ^.(L^Lj) > 0, axgrnin^L^Lj) = 2,E IMl (I(x £ Yi)D LljXjP ) = 

1<W<^ 2<i<K 

(PuPuPu,Pu T PuPu) = (Pl,P 2 ) G Xo) d(M(P 1; P 2 )) 

+ 7B,d (L 2 £ G(D, d) : dim(Li 9 L 2 ) < 1, arL 2 JL L x ) = + = 0. 

We prove (105) by using the following two lemmata. 

Lemma 9.6. //'dim(Li Q L 2 ) > 2 and Li /i nof orthogonal to L 2 , f/ie« f/ie sef 

Z = {L £ G(D,d) :P Ll (P L2 - P L )P Ll = 0, P^(P L2 - P L )P L ± 1 = 0} 

is infinite. 

Lemma 9.7. 7/L 2 ,L 2 £ G(D,d) satisfy L 2 £ L 2 , d .(L 2 ,Li) V d .( L 2,Li) 

< mina^JC^^.L!). P Ll (P £2 - P£ 2 )P Ll = 0, and P^{P U - P L JP^ = 0, 
then either L 2 or L 2 will not satisfy the condition in (105). 

Lemma 9.6 implies that there are infinite subspaces L £ G(D, d) satisfying P£ P\ j Pl 1 = 
Pi and Pj^PlP]^ = P2. On the other hand, Lemma 9.7 implies that only one sub- 
space out of these infinite subspaces satisfies the underlying condition of (125) and thus 
the latter equation is proved. 

We conclude the proof by verifying Lemmata 9.6 and 9.7. 
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Proof of Lemma 9.6. We denote Li = Li G (Li n L 2 ) and L 2 = L 2 G (L x n L 2 ). The 
idea of the proof is to construct a one-to-one function g : S ^ 1 n L 2 — > Z. Using this 
function and the fact that dim(L 2 ) = dim(Li) — dim(L 2 n Li) > 2 we conclude that 
Z, which contains g(S D ^ 1 n L 2 ), is infinite. 

For any Uo € S* 15-1 n L 2 , we arbitrarily fix vo = vo(uo) as one of the two unit 

vectors spanning Li n f L 2 Q Sp(uo) N ) . The vector vo exists since 



dim ^LiH (L 2 GSp(u )) ± j > dim(Li ) + dim ( ^L 2 G Sp(u )) ± 

=d+ (D - d+ 1) -D = 1. 

We define the function g as follows: 

Sf(uo) = Sp(u - 2(vq u )v , L 2 Q Sp(uo)). 
We first claim that the image of g is contained in Z. Indeed, we note that 

9(uo) 



Pgtuo) ~ P U = ( u - 2(Vq U )V ) T (U - 2(Vq U )v ) - Ug Uq (126) 



= - 2 ( v o u o) ( v o ( u o - Oo u o)v ) + (u - (vq u )v ) t v ) . 

Combining ( 126) with the following two facts: vo G Li and Uo — (vjuo)vo G Lf- we 
obtain that g(uo) G Z. 

At last, we prove that g is one-to-one and thus conclude the proof. If on the con- 
trary there exist m, u 2 G S ^ 1 n L 2 such that ui 7^ u 2 and g(ui) = g(u 2 ), 
then g(ui) = Spp(ui), g(u 2 ) 2 (L 2 © Sp(ui)) + (L 2 G Sp(u 2 )) D L 2 . Since 
dim(g(ui)) = dim(L 2 ) we conclude that g(ui) = L 2 . On the other hand we claim 
that for any Uo G S ^ 1 n L 2 , g(uo) 7^ L 2 and thus lead to contradiction. Indeed, since 
Uq G L 2 , vo G Li and Li is not orthogonal to L 2 we have that v^Uo 7^ and conse- 
quently Uo — (v^uo)vo 7^ Uo. Applying the latter observation in (126), we obtain that 
F g(u ) Plo and consequently g(u ) 7^ L 2 . 

□ 

Proof of Lemma 9. 7. We assume by contrary that both L 2 and L 2 satisfy the underly- 
ing condition of (105) and conclude a contradiction. We use the notation Yi and Yi 
of (118). 

We arbitrarily fix here x G Yi \ Yi. We note that dist(x, Li) < dist(x, L 2 ) and 
dist(x, Li) < argmin 3<i<Jf dist(x, Li). Since x ^ Yi, dist(x, Li) > dist(x, L 2 ) and 
thus 

dist(x, L 2 ) < dist(x, Li) < dist(x, L 2 ). (127) 

Consequently, 

x T (P £2 - P£ 2 )x = dist(x,L 2 ) 2 -dist(x,L 2 ) 2 < 0. (128) 

We partition P u - P u into four parts: P u (P u - P u )P Ll , P± {P u - P U )Pt x , 
Pl i ( P L 2 ~ Pu) p u > and Pu. (Pi - Pl 2 ) p U ■ The first two are zero ' and the last two 
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are adjoint to each other; we thus only consider P\ Jl (P^ — PlJPlI ■ * ts SVD be 

d 

Pu (P U - PtM, = USV = ]T ViUivJ . (129) 

i=l 

We can express the SVD of — P^ using (129) and the partition mentioned above 
as follows: 

d 

Combining (128) and (130), we obtain that 

n n 

c<uf xx T v, - x T (^ aiimvf + ViuJ ))x/2 < 0. (131) 

i=l t=l 

We define a function / : R DxI5 -> R such that for any A G E. DxD : /(A) = 
J27=i CT i u f Avj. Using (131) and the facts that {ui}f =1 G Li and {vj}f =1 G L^, we 
have 

/(D Ll , XiP ) = dist(x,Lx)CP- a ) f(P Ll (x)P^(x) T ) 

n 

= dist(x, u) [p - 2) °^ p ^ W p u w T ^ 

i=l 

71 

= dist(x, Li) (p " 2) ^uf xx T v 4 < 0. (132) 

i=l 

Similarly, for any point x G Yi \ Yi : 

/(D Ll) x,p) > 0. (133) 

Combining (119), (132), (133), Lemma 9.5 and the linearity of / we conclude the 
following contradiction establishing the current lemma: 

= / (e^ (J(x G Yx \ Yi) D Ll , x , p ) - £? w (J(x G Yi \ Yi) D Ll , x , p ) 



= / (^(/(x G Y 1 \ Y!)D Ll , x , p )j - / (£ M0 (/(x G Yi \Y 1 )D Ll>X)P )J > 0, 

(134) 

□ 
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Appendix A: Supplementary Details 
A.l. Proof of Lemma 9.1 

We will use the following inequality, which we verify below in Section A. 1 . 1 : 



Mi (x G B(0,l)nLi : dist(x,Li) < /3dist(Li, Li) ) < (3d? V/3 > 0. (135) 



We denote fo = XflKdfi and apply (135) to obtain that 




for any 1 < % < K. 



Consequently, we derive the following estimate 



Hi x G B(0, 1) n Li : min dist(x, Lj) > foe 




i=i 



and thus by Chebyshev's inequality the lemma is concluded as follows: 



(e ip (x,L l7 L 2 ,... ,1k)) > 



□ 
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A. 1.1. Proof of (135) 

We denote the principal angles between Li and L2 by {Oi}f =1 , the principle vectors 
of Li and Li by {vi]f =l and {~Vi\f =l respectively, the interaction dimension by k = 
fc(Li, L2) (see Section 2.2), the volume of the <i-dimensional unit ball by and 



sm(0 4 ) 5 



X) i= i sin(0 j ) : 



-, i = 1, . . . , d. 



Expressing the points in Li by their coordinates with respect to {vj}f =1 , we obtain that 
|x g B(0, 1) n Li : dist(x,Li) < £dist(Li,Li)} 



= |x= (xi,x 2 ,--- ,Xd) G B(0,l)nLi 
C < x = (a;i,x 2 ,-- - ,»<i) G B(0,l)nLi 



\ X] s ? sin2 ^ < ^ ' 
\ i=i 



\ i=i 



\ 51 sin2 6,8 



= < x = (zi,x 2 , ■ 



,.T d ) G B(0,l)nL! : , J2^ iX i < f' 3 



C <^ x = (xi,a;2, • ■ • GB(0, l)nLi : |xi| < 



2^ 



Since £? =1 7i = 1, WLOG we assume that ji > 1/k > l/d and consequently get 
that 

|xGB(0,l)DLi :dist(x,Li) < j3dist(Li,Li)} 

c|x= (xi.aa,-- - ,*,*) e B(0,l)nLi : < ^p8, |a> 2 | < ^Y^A ^ x | ■ 
Therefore 

Vol jx:xGB(0,l)nLi,dist(x,Li) < /?dist(Li,Li)} < 27rv d „ 2 Vd/3. (136) 

Combining (136) with the immediate observation: Vd = ^fvd-2, we conclude (135) 
as follows: 

/ii |x g B(0, 1) n Li : dist(x,Li) < /3dist(Li,Li)| 
=Vol |x g B(0, 1) D L x :dist(x,Li) < /3dist(Li, Li)| /Vol {x g B(0, 1) n Li} 

27rv d _2Vd/3 



Vd 



= £(2 = 



□ 
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A.2. Proof of Lemma 9.2 

We denote the principal angles between the d-subspaces Li, L 2 by 6\ > # 2 > 03 > 
■■ ■ > 64. Arbitrarily choosing Qi, Q 2 € 0(D,d), representing Li, L 2 respectively, 
we note that 

|dist(x,Li) -dist(x,L 2 )| = | ||x-xQiQf || - Hx-xQaQ^H | 
<||x-xQiQf -x + xQ 2 Q 2 n || < ||x|| ||QiQf ~ Q2Q2 || F 

d 

= ||x||di8t(Li,L 2 ). 

i=l 

□ 



\ 



sin 



l) 2 < IM 



\ 



A.3. Proof of Lemma 9.3 

We assume WLOG that i = 1 in (18). We thus need to prove that for all L € G(D, d): 

E(dist(xi,L) p ) +E(dist(x 2 ,L) p ) > E(dist(xi, U) p ) + E(dist(x 2 , Li) p ). (137) 

We denote the principal angles between Li and L 2 by {Oi}f =1 , the principle vectors of 
Li and L 2 by {~Vi}f—x and {vj}f =1 and the complementary orthogonal system for L 2 
w.r.t. Li by {u,} 4 d =1 . 

We notice that we can restrict the set of subspaces L satisfying (137). First of all, 
we only need to consider subspaces 

L £ Li + L 2 . (138) 

Indeed, the LHS of (137) is the same if we replace L by L n (Li + L 2 ). 
Second of all, we claim that it is sufficient to assume that 

Sp(v,, Vi) <£ L for all 1 < i < k. (139) 

Indeed, WLOG let i = 1 and suppose on the contrary to (139) that Vi, Vi € L. Since 
L is d-dimensional, there exists 2 < j < d (assume WLOG j = 2) such that it 
does not contain both Vj and vj. For any pair of points x = ^ i=1 cijV, G Li and 

x = X)i=i a iVi G L 2 : 



dist(x, L) = y sin(6' 2 ) 2 a| + 1/^ and dist(x, L) = ysin(6'i) 2 af + 

where 

( d -\ ( d -\ 

v\ = dist I a,Vj, L ) and ^ 2 = dist I a^v;, L ) . 

Now, for L = Sp(L \ {vx, Vx}, Vx, v 2 ), we obtain that 



dist(x,L) = Wsin(6*i) 2 a 2 +sin(6' 2 ) 2 a2 + v\ and dist(x, L) = V\. 
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Therefore 

dist(x, If + dist(x, L) p < dist(x, L) p + dist(x, tf 
and by direct integration we have that 

E(dist(xi,L) p ) +E(dist(x 2 ,L) p ) < E(dist( Xl , L) p ) + E(dist(x 2 , L) p ). 

We can thus replace the subspace L with the subspace L, which satisfies (139) (for 
i = 1, but can similarly be changed for all 1 < i < K). 

It follows from (138) and (139) that L can be represented as follows: 



L = Sp(vJ,v 



where 



Thus, for any pair of points x = z~2i=i a i w i Li and x = ^ i=1 a.i'Vi € L 2 : 



dist(x, L) 



and 



. 5^sin 2 #*a 2 and dist(x, L) 

\ i=i 



dist(x, Li) = and dist(x,Li) 



, ^ S in 2 (^-^K 2 



\ 51 sin 2 ^a 2 . 
\ i=i 



(140) 



(141) 



Combining (140), (141), the triangle inequality (for "sine vectors" in R d ) and the sub- 
additivity of the sine function, we conclude that 



dist(x, L) +dist(x,L) > 



^(sin^+sin(^-^)) 2 a 2 



> 



. ^sin 2 #iaf = dist(x, Li) + dist(x, Li). 



Since p < 1, this inequality extends to 

dist(x, L) p + dist(x, L) p > dist(x, Li) p = dist(x, Li) p + dist(x, Li) p . 



(142) 



Integrating (142) w.r.t. the uniform distribution we conclude (137) and thus prove the 
lemma. □ 



A.4. Proof of Lemma 9.4 

Assume first that (7(1), ■■ • , I{K)) is a permutation of (1, • • ■ , K), then I has an in- 
verse function, 7 _1 . We define 



M = argmax 1 < J < A -dist(L i ,L /(j) ) 
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and note that 

i min^dist(L M ,L J ) = dist(L M , L /(M) ) (143) 
= dist((Li,L 2 , • • • , Lrt), (L j(1 ), L j( 2), • • • ,Li(k))) > do- 
Combining (143) with Lemma 9.1 we obtain that 

E nu e i P ( x ^> L 2,--- ,Lk) - J5 MM ej p (x,Li,L 2) • • • , Lif ) (144) 

For any x £ <Yo, let m(x) = argmin 1<i<A -dist(x, Lj), m(x) = argmin 1<i< j t - 
dist(x, L,-) and note that 

e; p (x,Li,L 2 , ■ ■ • Xk) ~ e /p (x,Li,L 2 , • • • ,L K ) = dist(x, L A(x) ) p (145) 
- dist(x, L m(x) ) p > dist(x, L A(x) ) p - dist(x, Lj-i (rfl ( x ))) p 

> — 1 1 x 1 1 p dist (L a(x) j L/- 1 (a(x) ) ) p > -Hxf^ > -<%, 
where the second inequality in (145) uses Lemma 9.2. Therefore, 

E^ ei p (x,Li,~L2, ■ ■ ■ Xk) - S Mo e Zp (x,Li,L 2 , • • • ,L K ) > — <ig. (146) 
At last, we observe that 

£^ AI e; p (x, Li,L 2 , ■ ■ • ,L K ) - S /J e Zp (x, Li, L 2 , • ■ ■ ,Ljf) 

> «m ^ AlM e /p (x,Li,L 2 , • • • ,Lk) - ^M e Jp( x > L i> L 2, • • • ,Lif)) 

+ "o f^Mo e i P ( x iLi,L2, • • ■ ,Ljf) - S Mo e Zp (x,Li,L 2 , • ■ ■ ,L K )J . (147) 

Combining (144), (146) and (147), the lemma is proved in this case. 

Next, we assume that 1(1), • • • , I(K) is not a permutation of 1, 2, • • • , K, where 
we use some of the notation introduced above. In this case, there exist 1 < n\ , n 2 < K 
such that I(ni) = /(n 2 ) and consequently 

2 i min^dist(Lji/,L i ) = 2dist(L M , L 7(M) ) > dist(L ni , L /(tll) ) + dist(L„ 2 , L 7( „ 2) ) 

> dist(L„ l; L„,) > min dist(Lj, LA (148) 

l<i,j<K 

Combining (148) and Lemma 9.1 (applied with e = mmi<;j</< dist(L.;, Lj)/2), we 
obtain that 

E^e^ixXi,^,--- Xk) - E^ M ei p (x,L ll L 2 , ■ ■ ■ ,L K ) (149) 
> r ( min dist(L,,L :) )/2) p . 

l<i,j<K 

Using the above notation for m(x) and m(x) we get that for any x £ Xq\ 

e; (x,Li,L 2 , • • • ,Lk) - ej (x,Li,L 2) • • • ,L K ) (150) 
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= dist(x,L A(x) ) - dist(x,L m(x) ) > -1 

and consequently 

£ , , I0 e; p (x,Li,L 2 , • • ■ Xk) - £^ e/ p (x,Li,L 2 , ■ ■ ■ ,Ljf) > -1. (151) 
The lemma is concluded by combing (147), (149) and (151). 

□ 



A.5. Proof of (39) 

The fact that (P^ 1 (x)Pl x (x) t ) is a scalar matrix follows from the uniformity of 
Hi on Li U B(0, 1). We compute the underlying scalar, <5*, as follows. We arbitrarily 
fix a vector v 6 W l as well as a (d— l)-subspace Li C Li orthogonal to v and observe 
that 

5, = ((P Ll (x) T v) 2 ) = (dist(x,L) 2 ) . 

We further note that for any < r < 1, the set {x G B(0, 1) n Li : dist(x, L) = r} 
consists of two (d— l)-dimensional balls of radius y/l — r 2 . We consequently compute 
the constant (5* using the beta function B and the Gamma function T in the following 
way: 



S,=E^ dist 2 (x,L) 



XLo r 2 (l - r 2 )^ dt _ f} = „ sin 2 (6) cos^ 1 {9) d6 



B(l^) r(pr(4±i)r(^) j_ 
r(|)r(^±i)r(^) + 

□ 



A.6. Proof of (40) 

For simplicity we denote B = Y^i=i ( x 0^l 1 (xi) T . We note that if maxj at (B— 
(J* Id) < 77, then 

^ BV ||~||** V ^ < 1 for a11 »e Ri \ M, 

and consequently 

5* - 77 < Ji^li for all u G R d \ {0}, 
that is, min t cr t (B) > <5* — r/. □ 
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